Welcome to Part 2 of this series where we learn how to run compute shaders using DirectX 12. This time we will take the theory we discussed in Part 1 and put it into practice.

In this chapter, we’re going to write our first compute pipeline that is able to present screen-space coordinates. It might not look exciting yet but know it’s the foundation for most compute pipelines.

For example, with this result and a few extra minutes of work we can do all sorts of post-processing effects or compute any sort of two-dimensional buffer of data.

If you’re looking for something more exciting than just screen-space coordinates know that in the next and final chapter, we are going to make a small and interactive particle simulation that can handle millions of particles.

Now that we know what we’re going to work towards, let’s look at what we will need to make:

  • A ComputeBuffer class which allows us to make a GPU resource that we can read and write from via compute shaders.
  • A Compute Shader that outputs screen space coordinates to our assigned buffer.
  • A ComputePipeline class, which compiles our compute shader, and uses it to create a pipeline we can use with our command list.
  • A ComputeRenderingStage class, which encapsulates all our code rendering commands and resources relevant to running a compute shader.

Overall, we will write around 200-300 lines of code, which will allow us to run Compute Shaders in a DirectX 12 environment.

For this chapter, I’ve written a small framework in Visual Studio 2022 that you can use to program along with. It comes with DirectX 12 initialized and a window we can render to. You can download it and write along with the article, or work within your own project/framework.

In case you want to use the framework it can be found here: https://github.com/stefanpgd/Compute-DirectX12-Tutorial.
Make sure to clone/download the ‘main’ branch. Or, you can download the zipped ‘main’ from GitHub by clicking here.

Inside the framework’s solution, you can find a filter/folder under both Source & Headers called ‘Tutorial’. Here you can find the files we will be working with. When running the solution you should have a window pop up with a black background.

Prerequisites

This article assumes you’ve a grasp on the following topics so that you can follow along:

  • A basic understanding of the theory behind compute shaders.
  • An understanding of the primary DirectX 12 components. Things like the Device, Root Signature, Descriptor Heap and Resource.
  • A basic understanding of how HLSL works, including shader registers & spaces.

If you aren’t familiar with the theory behind compute shaders, then maybe give the first part of this article series a (re)read.
If you aren’t familiar yet with the mentioned DirectX or HLSL topics, then I recommend following the first two chapters of Jeremiah van Oosten’s Learning DirectX 12 series.

With that in mind, let’s start with writing our own compute pipelines.


Writing the ComputeBuffer

For our shader to compute something, we first need some data to work with. This means that with DirectX we need to allocate a buffer of data on the GPU. These buffers will need to be different from our regular Constant Buffers or Shader Resources. Why?
Well, we need to be able to read & write into our buffer, which is something we normally can’t do.

To be able to have this read/write functionality, we need to mark that our buffer has ‘Unordered Access‘.
Unordered Access means that our buffer can be read/written to simultaneously by multiple threads without generating memory conflicts. This is exactly what we need since we won’t know which threads are busy with our buffer when the compute shader is running.

With this comes a new descriptor you might’ve seen before, the ‘Unordered Access View‘ or UAV.
A UAV has the similar functionality of a Shader Resource View (SRV) but with the added benefit of read/write access.

So, let’s write a class that uploads a two-dimensional buffer of data with unordered access.

ComputeBuffer.h
#pragma once
#include <d3dx12.h>
#include <wrl.h>
using namespace Microsoft::WRL;

class ComputeBuffer
{
public:
	  ComputeBuffer(unsigned int width, unsigned int height, DXGI_FORMAT format);

	  ID3D12Resource* GetAddress();
	  CD3DX12_GPU_DESCRIPTOR_HANDLE GetUAV();
	  
	  // ... //

Let’s look at the basic structure of our ComputeBuffer header. We will take as input a given width and height for our buffer. Then we also have the DXGI_FORMAT argument, which indicates the type of our element that will be stored in the buffer.

We also have the functions GetAddress() to retrieve a pointer to our resource and GetUAV() so we can get our descriptor. Let’s look at what else we will need:

ComputeBuffer.h
	  CD3DX12_GPU_DESCRIPTOR_HANDLE GetUAV();
	  
private:
	  void AllocateDataOnGPU();
	  void CreateDescriptor();

private:
	  ComPtr<ID3D12Resource> buffer;
	  unsigned int uavIndex;

	  unsigned int width;
	  unsigned int height;
	  DXGI_FORMAT format;
	};

Here we defined some functions so that we can allocate our buffer on the GPU and create a descriptor for our resource. Lastly, we will store resources and variables relating to the dimensions of our buffer.

Now let’s take a look at the implementation:

ComputeBuffer.cpp
#include "Tutorial/ComputeBuffer.h"
#include "Graphics/DXAccess.h" // Part of the framework to retrieve things like Device or Descriptor Heap(s)
#include "Graphics/DXDescriptorHeap.h"

ComputeBuffer::ComputeBuffer(unsigned int width, unsigned int height, DXGI_FORMAT format) :
    width(width), height(height), format(format)
{
    AllocateDataOnGPU();
    CreateDescriptor();
}

ID3D12Resource* ComputeBuffer::GetAddress()
{
    return buffer.Get();
}

CD3DX12_GPU_DESCRIPTOR_HANDLE ComputeBuffer::GetUAV()
{
    DXDescriptorHeap* UAVHeap = DXAccess::GetDescriptorHeap(D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV);
    return UAVHeap->GetGPUHandleAt(uavIndex);
}

Most of this is still straightforward. In our constructor we first make sure to allocate our buffer on the GPU, followed directly by creating a descriptor for the resource. Then we define our getters, one to retrieve a pointer to our resource, another to get our GPU descriptor handle from the descriptor heap.

Now let’s look at how we can allocate a buffer with unordered access:

ComputeBuffer.cpp
void ComputeBuffer::AllocateDataOnGPU()
{
    // 1) Write a description of our resource //
    D3D12_RESOURCE_DESC bufferDescription = {};
    bufferDescription.Flags = D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS;
    
    bufferDescription.Dimension = D3D12_RESOURCE_DIMENSION_TEXTURE2D;
    bufferDescription.Width = width;
    bufferDescription.Height = height;
    bufferDescription.Format = format;
    
    bufferDescription.MipLevels = 1;
    bufferDescription.DepthOrArraySize = 1;
    bufferDescription.SampleDesc.Count = 1;
    bufferDescription.Layout = D3D12_TEXTURE_LAYOUT_UNKNOWN;
    
    // 2) Allocate our buffer with the device using the resource description //
    CD3DX12_HEAP_PROPERTIES defaultHeap = CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_DEFAULT);
    
    DXAccess::GetDevice()->CreateCommittedResource(&defaultHeap, D3D12_HEAP_FLAG_NONE,     
        &bufferDescription, D3D12_RESOURCE_STATE_COMMON, nullptr, IID_PPV_ARGS(&buffer));
}

We first start with defining a D3D12_RESOURCE_DESC, starting with the flag that gives our resource unordered access. Next, we set the dimensions of our buffer, using the parameters we have passed in the constructor. Then at the end of our function, we create a committed resource on the default heap.

The only thing left is to create a descriptor for our resource. We need this so we can bind our buffer to our compute pipeline later on in this article.

ComputeBuffer.cpp
void ComputeBuffer::CreateDescriptor()
{
    DXDescriptorHeap* UAVHeap = DXAccess::GetDescriptorHeap(D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV);
    
    // Create UAV //
    D3D12_UNORDERED_ACCESS_VIEW_DESC uavDesc = {};
    uavDesc.Format = format;
    uavDesc.ViewDimension = D3D12_UAV_DIMENSION_TEXTURE2D;
    
    uavIndex = UAVHeap->GetNextAvailableIndex();
    DXAccess::GetDevice()->CreateUnorderedAccessView(buffer.Get(), nullptr, &uavDesc, UAVHeap->GetCPUHandleAt(uavIndex));
}

Writing the Compute Shader

Now that we have our compute buffer, we can start writing our compute shader. If you are using the provided framework, you can find a shader called ‘hello.compute.hlsl’ in the ‘Source>Shaders’ folder. But let’s quickly showcase what to do when creating a new shader using Visual Studio 2022.

You can create a new shader file by right-clicking on any folder or filter in your Solution and selecting ‘Add New Item’, then navigate to the appropriate HLSL tab and select ‘Compute Shader File’ (if this tab doesn’t appear, make sure to have the ‘Game development with C++’ package installed via the Visual Studio Installer).

After creating it, right-click your shader file in your solution explorer and go to ‘Properties’. Under ‘HLSL Compiler>General’ you can find your Shader Type and Shader Model. Make sure these are set to ‘Compute Shader (/cs)’ and ‘Shader Model 5.1’ (or a higher/newer version).

Now that we have our file, let’s quickly go over the purpose of our shader. We’re going to receive a two-dimensional buffer of a given width & height. To this buffer we want a single thread to output a float4 containing the screen-space coordinates. Let’s see how we could do this:

hello.compute.hlsl
RWTexture2D<float4> buffer : register(u0);

[numthreads(8, 8, 1)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)
{
    int width;
    int height;
    buffer.GetDimensions(width, height);
    
    float2 uv = dispatchThreadID.xy / float2(width, height);
    buffer[dispatchThreadID.xy] = float4(uv.xy, 0.0f, 1.0f);
}

The first line in our shader describes our unordered access buffer. The RWTexture2D type is probably new to you. The ‘RW‘ stands for ‘Read Write’, indicating that this buffer has unordered access. We register this resource to slot u0. The ‘u‘ shader registers are meant for UAVs.

After that, we determine the amount of threads for our thread group. For this example, I went with 64 threads. One of the reasons why is that thread groups of size [1, 1, 1] are possible but not recommended for efficiency/concurrency.

GPU vendors have suggested thread group sizes that work more efficiently with their architecture. Nvidia’s suggested size for thread groups is 32, and for AMD usually 64 (or 32 based on driver settings with the new RDNA architecture). But don’t feel fully restricted by these suggestions, if you need more or fewer threads feel free to adjust to your need.

The code that’s left is mostly simple. We know that every thread is working on a single pixel in our buffer. So, for every thread we want to calculate a UV which is based on its location relative to the buffer’s dimensions. Then we write these UV/screen-space coordinates into our buffer. As you might have noticed, we can directly write into our buffer similarly to how you would index into an array.

And that’s everything we need to do for our first compute shader. Now let’s look at how we can compile it.


Writing the ComputePipeline

The compute pipeline combines a root signature with a compute shader file into a Pipeline State Object. Luckily for us, making a compute pipeline is a lot simpler than a pipeline that does rasterization. Let’s look at our header file.

ComputePipeline.h
#pragma once
#include <string>
#include <d3dx12.h>
#include <wrl.h>
using namespace Microsoft::WRL;

class DXRootSignature;

class ComputePipeline
{
public:
	  ComputePipeline(DXRootSignature* rootSignature, const std::string& shaderFilePath);

	  ID3D12PipelineState* GetAddress();
	  
	  // ... ///

For our constructor, we take in a root signature together with the relative path (from the solution or executable) to our shader. Then we also have our GetAddress() function again, allowing us to get a pointer to the Pipeline State object.

We also need two more functions. One is for compiling our shader and the other is for combining the root signature and shader (blob) into a pipeline.

ComputePipeline.h
    	ID3D12PipelineState* GetAddress();

private:
	  void CompileShaders(const std::string& shaderFilePath);
	  void CreatePipelineState(DXRootSignature* rootSignature);

private:
	  ComPtr<ID3D12PipelineState> pipeline;
	  ComPtr<ID3DBlob> computeShaderBlob;
};

Note that you don’t have to compile compute shaders at run time.
You can also use pre-compiled shaders (.cso) with some slight adjustments to the code (see Microsoft Documentation). I prefer to compile shaders at run time in personal projects since it gives me the option to recompile a shader while the program is still running.

Now that the header is done, let’s look at the implementation:

ComputePipeline.cpp
#include "Tutorial/ComputePipeline.h"
#include "Graphics/DXUtilities.h"
#include "Graphics/DXRootSignature.h"
#include "Utilities/Logger.h"

#include <d3dcompiler.h>
#include <cassert>

ComputePipeline::ComputePipeline(DXRootSignature* rootSignature, const std::string& shaderFilePath)
{
	  CompileShaders(shaderFilePath);
	  CreatePipelineState(rootSignature);
}

ID3D12PipelineState* ComputePipeline::GetAddress()
{
	  return pipeline.Get();
}

Like before, we start with the constructor. We first need to compile our shaders before we can create the pipeline. We also provide a getter to retrieve a pointer to our pipeline state object.

Now let’s take a look at how we can compile compute shaders at run time:

ComputePipeline.cpp
void ComputePipeline::CompileShaders(const std::string& shaderFilePath)
{
    // 1) turn 'string' into 'wstring'
	  ComPtr<ID3DBlob> computeError;
	  std::wstring computeFilePath(shaderFilePath.begin(), shaderFilePath.end());

    // 2) Compile compute shader
	  D3DCompileFromFile(computeFilePath.c_str(), NULL, NULL, "main", "cs_5_1", 0, 0, &computeShaderBlob, &computeError);

    // 3) Check if there were any errors, if so assert & check the console for the error message
	  if (computeError != NULL)
    {
		  std::string buffer = std::string((char*)computeError->GetBufferPointer());
		  LOG(Log::MessageType::Error, buffer);
		  assert(false && "Compilation of shader failed, read console for errors.");
	  }
}

With the D3DCompiler library, we can compile our shader at run time using the D3DCompileFromFile function.
I want to focus on one parameter here, the pEntry, which in this case is “cs_5_1”. This identifier tells the compiler that we want to compile a ‘compute shader’ using shader model ‘5.1’. Make sure that whenever you use any other shader model you appropriately adjust this identifier so that it matches.

We also passed an ID3D12Blob called computeError. When this gets initialized, it means that an error or warning occurred while compiling the shader. When working on shaders, I recommend running the solution in Debug mode. In debug mode, the shader compiler will also throw warnings/errors relating to efficiency and other matters that could be ignored while in Release mode.

Let’s look at the final section of our class where we create our pipeline:

ComputePipeline.cpp
void ComputePipeline::CreatePipelineState(DXRootSignature* rootSignature)
{
	  struct PipelineStateStream
	  {
		  CD3DX12_PIPELINE_STATE_STREAM_ROOT_SIGNATURE RootSignature;
		  CD3DX12_PIPELINE_STATE_STREAM_CS CS;
	  } PSS;

	  PSS.RootSignature = rootSignature->GetAddress();
	  PSS.CS = CD3DX12_SHADER_BYTECODE(computeShaderBlob.Get());

	  D3D12_PIPELINE_STATE_STREAM_DESC pssDescription = { sizeof(PSS), &PSS };
	  ThrowIfFailed(DXAccess::GetDevice()->CreatePipelineState(&pssDescription, IID_PPV_ARGS(&pipeline)));
}

If you made pipeline state objects before, you will probably notice that a compute pipeline is (luckily) a lot less work. This is because our pipeline is a lot simpler compared to a pipeline that does rasterization which usually has many stages.

With this done, we made all the components we needed. Now we can start using them, which brings us to the final section.


Writing the ComputeRenderingStage

The compute rendering stage is an encapsulation of our compute components and rendering code. When we have finished writing this class, we should have our first compute shader running.

Let’s look at what we are going to need:

ComputeRenderingStage.h
#pragma once
#include <d3dx12.h>
#include <wrl.h>
using namespace Microsoft::WRL;

class DXRootSignature;
class ComputePipeline;
class ComputeBuffer;

class ComputeRenderingStage
{
public:
	  ComputeRenderingStage();

	  void RecordStage(ComPtr<ID3D12GraphicsCommandList4> commandList);
	  
private:
	  void InitializeResources();
	  void InitializePipeline();

private:
	  DXRootSignature* computeRootSignature;
	  ComputePipeline* computePipeline;

	  ComputeBuffer* backBuffer;
};

As we can see, this class again is quite simple. It’s mostly some functions to initialize the components that we’ve made so far together with a function called RecordStage(...).

In the RecordStage(...) function, we will be recording our rendering commands into a received command list. In the provided framework I made sure that this function already gets called in the main rendering loop of the application. For reference, this is how that looks:

Renderer->Render()
  // ... //	
	
	// 3. Change the state of the renderTarget so we can render to it, then bind & clear it //
	TransitionResource(renderTarget.Get(), D3D12_RESOURCE_STATE_PRESENT, D3D12_RESOURCE_STATE_RENDER_TARGET);
	BindAndClearRenderTarget(window, &rtvHandle, nullptr);

	// 4. Record out compute pipeline(s) //
	computeRenderingStage->RecordStage(commandList);

	// 5. Render UI/ImGui and prepare render target for presenting //
	ImGui_ImplDX12_RenderDrawData(ImGui::GetDrawData(), commandList.Get());
	TransitionResource(renderTarget.Get(), D3D12_RESOURCE_STATE_RENDER_TARGET, D3D12_RESOURCE_STATE_PRESENT);
	
	// ... //

In case you aren’t using the framework, make sure to initialize the ComputeRenderingStage in a relevant place and to call RecordStage(...) in your rendering loop.

Knowing that, let’s look at the implementation of this class:

ComputeRenderingStage.cpp
#include "Tutorial/ComputeRenderingStage.h"
#include "Tutorial/ComputePipeline.h"
#include "Tutorial/ComputeBuffer.h"

#include "Graphics/DXRootSignature.h"
#include "Graphics/Window.h"
#include "Graphics/DXAccess.h"
#include "Graphics/DXUtilities.h"

ComputeRenderingStage::ComputeRenderingStage()
{
	  InitializeResources();
	  InitializePipeline();
}

void ComputeRenderingStage::InitializeResources()
{
	  unsigned int screenWidth = DXAccess::GetWindow()->GetWindowWidth();
	  unsigned int screenHeight = DXAccess::GetWindow()->GetWindowHeight();

	  backBuffer = new ComputeBuffer(screenWidth, screenHeight, DXGI_FORMAT_R8G8B8A8_UNORM);
}

We start by including the headers we’ve written so far, followed by some helper classes and components for rendering.

The InitializeResources() function will initialize a buffer that matches the size of our screen. We are also using the DXGI_FORMAT_R8G8B8A8_UNORM format, which is the same format that our swap chain uses. The reason for these formats matching will be relevant later on.
But first, let’s initialize our pipeline.

ComputeRenderingStage.cpp
void ComputeRenderingStage::InitializePipeline()
{
	  CD3DX12_DESCRIPTOR_RANGE1 backBufferRange[1];
	  backBufferRange[0].Init(D3D12_DESCRIPTOR_RANGE_TYPE_UAV, 1, 0, 0);

	  CD3DX12_ROOT_PARAMETER1 pipelineParameters[1];
	  pipelineParameters[0].InitAsDescriptorTable(1, &backBufferRange[0]);

	  computeRootSignature = new DXRootSignature(pipelineParameters, _countof(pipelineParameters));
	  computePipeline = new ComputePipeline(computeRootSignature, "Source/Shaders/hello.compute.hlsl");
}

Before we can initialize our pipeline, we first have to make a root signature.
The root signature for our pipeline only needs to know about our back buffer. Knowing that, we only require a single root parameter.

After we’ve initialized our root signature, we can immediately initialize our compute pipeline. With that done we can move to the final function of our tutorial:

ComputeRenderingStage.cpp
void ComputeRenderingStage::RecordStage(ComPtr<ID3D12GraphicsCommandList4> commandList)
{
	  // 1) Bind our root signature & pipeline //
	  commandList->SetComputeRootSignature(computeRootSignature->GetAddress());
	  commandList->SetPipelineState(computePipeline->GetAddress());

	  // 2) Bind resources needed for our pipeline //
	  commandList->SetComputeRootDescriptorTable(0, backBuffer->GetUAV());

	  // 3) Dispatch our compute shader //
	  unsigned int screenWidth = DXAccess::GetWindow()->GetWindowWidth();
	  unsigned int screenHeight = DXAccess::GetWindow()->GetWindowHeight();
	  unsigned int dispatchX = screenWidth / 8;
	  unsigned int dispatchY = screenHeight / 8;
	
	  commandList->Dispatch(dispatchX , dispatchY, 1);

	  // 4) Copy the result of our back buffer into the screen buffer //
	  ComPtr<ID3D12Resource> screenBuffer = DXAccess::GetWindow()->GetCurrentScreenBuffer();

	  TransitionResource(screenBuffer.Get(), D3D12_RESOURCE_STATE_RENDER_TARGET, D3D12_RESOURCE_STATE_COPY_DEST);
	  commandList->CopyResource(screenBuffer.Get(), backBuffer->GetAddress());
	  TransitionResource(screenBuffer.Get(), D3D12_RESOURCE_STATE_COPY_DEST, D3D12_RESOURCE_STATE_RENDER_TARGET);
}

Let’s quickly summarize what’s happening:
First, we bind our root signature and pipeline state. This is similar to how we normally would bind a rasterization pipeline. But take note that we are using commandList->SetComputeRootSignature(..) here, a function relevant for compute pipelines.

Afterward, we bind the UAV of our back buffer. Note that binding root parameters also have their own unique compute versions.
Then we grab the current screen size and divide it by the dimensions of our thread group. This is to make sure we take the size of the thread group into account and that the dispatch group isn’t larger than it needs to be.

Then finally, we want to call commandList->CopyResource(..). This is a function that allows us to copy data from one GPU resource into another. This is possible whenever the dimensions, format and mipmap details match between the resources. The first parameter is the copy destination, and the second parameter is the source resource.

This is a quick way to present the results of a compute pipeline without having to set up a rasterization pipeline to render a screen quad.
Do note that we need to switch the render target’s resource state from D3D12_RESOURCE_STATE_RENDER_TARGET to D3D12_RESOURCE_STATE_COPY_DEST so that we can perform this operation.


Closure

That’s it, well done! Hopefully with all of this combined your compute shader should run and present screen-space coordinates to the window. From here on out you can slowly build off from the template or move the code over to your personal project.

In case there are some issues, or just care about the end result. You can go to here: ‘Compute With DirectX 12 – Part 2 Code‘ to clone/download the version of the framework that contains the code we discussed.

If you are already comfortable with DirectX 12, it shouldn’t be too difficult to take these concepts and start making use of them. But, if you are looking for a bit more guidance, know that in the next and final chapter, we will expand these classes and make an interactive particle simulation.

In case you encounter problems, or have corrections, feel free to leave a comment or an issue on the GitHub repository.
Thanks for checking out my article, If you want to look out for future updates or are interested in what I’m up to, consider giving me a follow on Twitter/X.

Posted 21th of August, 2024

Leave a Reply

Your email address will not be published. Required fields are marked *