Hey there! Let’s talk about compute shaders and how we can write our own with DirectX 12.

This will be a small series aimed at game and graphics programmers who want to know about the theory of compute shaders, and how to write compute pipelines & shaders in DirectX 12 with C++.

In this first chapter, we won’t delve into much code yet. Instead, we will discuss the general theory of parallel processing and compute shaders. We will talk about what compute shaders are, why they can be so powerful and some things to keep in mind while working with them.

If you are already familiar with these concepts and you just care about how to set up compute in DirectX 12, then check out Part 2 where we start delving into the implementation.

Particle Life with Compute Shaders
Prerequisites

For this chapter, having a grasp on the following will make it easier for you to follow along:

  • A basic understanding of a shader language like HLSL or GLSL.
  • Some experience with a graphics API like OpenGL or DirectX 12

If these topics are still difficult or unknown to you don’t worry. You will likely still be able to follow along with this chapter. But before moving on to future chapters where we talk about DirectX in more detail, I recommend taking a look at the first two chapters of Jeremiah van Oosten’s Learning DirectX 12 series.

In case you just want to play around with compute shaders without having to create your own framework, know that in the next chapter I will provide a small template to run compute shaders with!

Disclaimer
This series is meant as an introduction to compute shaders. This will likely not be the “optimal” way to use them. Instead, this will be a starting ground to work off from with further research. There are many in-depth resources on how to achieve different things like maximum efficiency using compute but we will only get into the fundamentals in this series.

With all of this in mind, let’s start!


What is a Compute Shader?

To understand the purpose of a compute shader, we first need to ask a simpler question: What is a shader?
Simply put, a shader is a small program that we can run on our graphics card thanks to Graphics APIs. These shaders are most commonly used for rendering.

You are likely already familiar with some shaders, like the vertex shader and pixel/fragment shader. Each of these has a given purpose:

  • The purpose of a vertex shader is to transform the geometry it receives to clip space.
  • The purpose of a pixel/fragment shader is to output a color to a render target.

So, what’s the purpose of a Compute Shader then?
That’s the great thing, it can do anything we want it to do. It has no predetermined purpose.
We as the (graphics) programmer get to determine the exact purpose of the shader. We get to decide the input data and output data together with how many threads should compute our code.

To rephrase that a little: a compute shader is a shader where we are allowed to compute any sort of (shader-)code, even when it’s unrelated to rendering.

A simplified N-Body particle simulation (1 million particles) using compute shaders

With this comes a whole lot of freedom and opportunities to use a compute shader. There is a great selection of things that compute shaders are useful for. Which brings us to the next section.


Why would I even want to use a Compute Shader?

Through compute shaders we are able to run many complex algorithms and simulations very efficiently. Examples of such algorithms/simulations are:

  • Particle/Fluid simulations
  • Writing custom rendering code, e.g. post processing
  • Computing Radiance Cascades
  • Creating geometry, noise, textures etc.
  • Computing procedural generation algorithms
  • And much more!

But how? Well, our CPUs nowadays are amazingly fast, but sometimes we like to write code that even the mightiest of CPUs can’t handle efficiently. Code related to rendering is a good example of this. Transforming millions of polygons, rasterizing them, then shading the visible fragments and applying the results to a render target is not something that our CPU likes to do.

Luckily, we’ve our GPUs. As you might be aware, GPUs have a huge amount of threads. These threads are usually weaker than a CPU thread, but because of the amount of available cores, we can do many things at the same time. This is called parallel processing. With parallel processing, we can execute our rendering code at rapid speed.

Through compute shaders we can make use of parallel processing to run our code. Whenever we have a predetermined set of operations that needs to happen on a large chunk of data, where each piece of data can be computed independently of the others, we could consider moving that code into a compute shader to make use of the GPUs threads.

Let’s try to make sense of this through an example:
Imagine we have a thousand particles in our scene that and we want to update their position. We first create a buffer of particles and update them through this loop. In every frame we want our particles to move in a given direction, in this instance upwards.

C++
Particle particles[1000];

/// Some update loop later in our code ///
for(Particle& particle : particles)
{
    particle.Position += vec3(0.0, 1.0, 0.0);
}

As you can imagine, this is not very efficient. This single thread has to do a thousand things every frame.

Now let’s take the same code and move it into a compute shader.
(If the code doesn’t make sense yet, don’t worry. We will get to that later)

HLSL
// Compute Shader code // 
RWStructuredBuffer<Particle> particles : register(u0);

[numthreads(1000, 1, 1)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)
{
    particles[dispatchThreadID.x].Position += vec3(0.0, 1.0, 0.0);
}

What we see here is the same code moved into a compute shader. It looks quite similar in many places but this code will likely be significantly faster. Why? Because right now a thousand threads on the GPU will help compute this code at once compared to a single thread doing a thousand tasks.

The core premise of this example is to show you that with a GPU we can let a thousand threads do a single thing at the same time, while a CPU has to do a thousand tasks on its own. This is why compute shaders are so fast since they can run our code easily in parallel!

This example is arbitrary of course. A CPU could probably handle a thousand particles. But, we aren’t limited to a thousand threads, we could compute millions of particles at the same time if we wanted. This is where we get to unlock the real power of our GPUs and parallel processing.

This concept was also beautifully demonstrated by MythBusters

Some ‘drawbacks’ to Compute Shaders

Compute shaders are great, but they aren’t the solution to everything. Let’s highlight some of the things you need to keep in mind before moving your code over to the GPU:

1. Not all code fits parallel computing
The cores of our CPU are made to sequentially compute code. Thanks to the CPU’s architecture it can handle branching and complex code. Meanwhile, our GPU is designed for relatively simple computation but with many cores. Shaders can’t handle a lot of branching, nor can threads communicate with each other like they can on the CPU. Determining if your code is suitable for parallel computing is an important step to make before you start.

2. Your resources need to be available on the GPU
If you want to compute code on the GPU, then your data also needs to be available on the GPU. This is at first glance likely not a problem but it can be quite tricky when your CPU also needs the data. Uploading or reading back GPU resources can sometimes be an elaborate process. Also, you will need to allocate extra memory on the GPU which in certain situations could be troubling. But for most side projects this won’t be a problem.

3. The data we send can be accessed by any thread
Whenever we want to compute some block of data it’s usually fully accessible. Meaning that we can read and write the data from any thread. Whenever this is the case we should work with caution. Indexing the wrong data or the same data twice can lead to undefined behaviour.

4. Optimizing compute shaders can be tricky
For most side projects, this won’t be a real problem. But in case you want to squeeze everything out of your GPU it can be tricky to achieve. Balancing your shaders for maximum thread occupancy can be quite a process to get done right. Determining the right amount of threads and thread groups for your shader can also be a thing to balance out. Making compute shaders super efficient will require some further reading into the topic to do properly.


Threads, Threads and… more Threads

Another bit of theory we need to understand before we can start programming is the different levels of threads when it comes to compute.

The first thing we need to wrap our heads around is that threads are visualized as 3D blocks. The image might help you visualize this concept.

  • A ‘thread‘ is something that runs our shader code
  • A ‘thread group‘ is defined at the top of our compute shader in the [numthreads(x, y, z)] section. It presents a number of threads that will execute our shader code.
  • The thread ‘dispatch‘ is defined in our source/CPU code with the command list. We can tell the dispatch how many thread groups we want to run the same shader.

Both the thread group and dispatch technically define a number of threads but there are some differences.

A thread group can be seen as “Some set of work I want to X amount of threads to work on”. You could summarize it as a work order for the GPU. The number of threads available for a thread group depends on the GPU and shader model version. But for most modern GPUs and shaders it will like be limited to 1024 threads (Microsoft documentation). For example:

HLSL
// This is allowed, since: 32 x 32 x 1 = 1024
[numthreads(32, 32, 1)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)
HLSL
// This is NOT allowed, since: 100 x 100 x 100 = 1 million threads 
[numthreads(100, 100, 100)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)

Then we have the dispatch. The dispatch can be seen as a “list of work orders to go through”. We tell our GPU that we want to execute our work order X * Y * Z amount of times. Depending on your GPU, the amount of in-flight work orders/threads can differ. We don’t know exactly in which order our GPU will execute the thread groups but if your shader is set up in the right way that shouldn’t matter.

The maximum amount of dispatches per dimension is 65.535. This means that technically, we could have a dispatch with 65.535^3 thread groups, which… is a lot.

C++
// CPU Rendering Code //
// Example of how a dispatch could look like on the CPU 
// 1. Bind pipeline & root signature
commandList->SetComputeRootSignature(rootSignature->GetAddress());
commandList->SetPipelineState(computePipeline->GetAddress());

// 2. Bind relevant root arguments 
commandList->SetComputeRootDescriptorTable(0, targetTexture->GetUAV());

// 3. Execute compute shader
commandList->Dispatch(1024, 1024, 1);

So, there is a lot of room for flexibility based on your situation. We will show an example in a bit to make it all clear, but first we need to know one more thing.


DispatchThreadID, our most useful semantic

We now know how to allocate threads to compute our shader, but how do we put those threads to work? This is where the semantics for compute shaders come into play. Shader semantics allow us to signal to the shader pipeline to give us some data. Within compute shaders we’ve a few semantics that will be useful to us. Let’s look at two of them:

  • SV_GroupThreadID‘ is the local 3D index of a thread within a thread group.
  • SV_DispatchThreadID‘ is the global 3D index of a thread within the dispatch.

Let’s start with GroupThreadID, every thread within a thread group has its own 3D index. On the right you can see an example of how it works.

We have a thread group with the size of (2, 2, 1) together with a dispatch that uses the same dimensions. As you can tell, the GroupThreadID showcases the local index of a thread in every thread group, which can be useful in some situations.

But then we’ve our DispatchThreadID. As you can see, the threads have an index that relates to their position in the dispatch. This is very useful because now each thread has some unique index.

DispatchThreadID is the key to being able to let a single thread compute some section of data. Almost any 1D or 2D array/buffer can easily be accessed by using this semantic. It makes working with compute shaders a lot easier.

Let’s go through a final example that combines all the theory:
Let’s say we have a 2D texture with the dimensions of 512×512 and our goal is to make all the pixels in the texture black. With a compute shader we can easily do this:

HLSL
RWTexture2D<float4> targetTexture : register(u0);

[numthreads(1, 1, 1)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)
{
    targetTexture[dispatchThreadID.xy] = float4(0.0f, 0.0f, 0.0f, 1.0f);
}
C++
// CPU Rendering code //
// 1. Bind pipeline & root signature
commandList->SetComputeRootSignature(rootSignature->GetAddress());
commandList->SetPipelineState(computePipeline->GetAddress());

// 2. Bind relevant root arguments 
commandList->SetComputeRootDescriptorTable(0, targetTexture->GetUAV());

// 3. Execute compute shader
commandList->Dispatch(512, 512, 1);

What we have here is an example of how we could clear our target texture. Notice how we use DispatchThreadID as a way to go over every pixel of the texture.
The size of our thread group is just one. So, with a dispatch of size (512, 512, 1) we know we have a dispatch 512 by 512 thread groups. Which matches up with our texture size. This way we can set every pixel in our texture to black.

Of course, there are many other ways to do this. For example, most texture sizes are divisible by 8. So we could do the following:

C++
// Note that not all textures are always divisible by 8.
unsigned int dispatchX = targetTexture.Width() / 8;
unsigned int dispatchY = targetTexture.Height() / 8;

commandList->Dispatch(dispatchX, dispatchY, 1);
HLSL
[numthreads(8, 8, 1)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)
{
    targetTexture[dispatchThreadID.xy] = float4(0.0f, 0.0f, 0.0f, 1.0f);
}

As you can see there are many ways to approach the same problem by balancing the threads.
Do know that GPU vendors have suggested thread group sizes. Nvidia’s suggested size for thread groups is 32, and for AMD usually 64 (or 32 based on driver settings with the new RDNA architecture).
But you aren’t limited by these sizes, if you need more or less your shader will still work.


Closure

That’s it! This is most of the theory you need to know about compute shaders to fully begin implementing them. This theory applies to almost any Graphics API and most engines like Unity or Unreal. The setup per engine/API might be different, but the theory we discussed will mostly be the same.

Hopefully, this was already useful for you! Next up is Part 2 where we set up a compute pipeline.
In case you found this interesting consider following me on Twitter/X to stay up to date with new projects and articles I’m working on.

If you happen to have any questions or corrections, feel free to leave a comment.

Posted 13th of August, 2024