Mandela’s Library of Alexandria

Man working with Internet-in-a-Box home page

Internet-in-a-Box learning content examples

Quality Content

Internet-in-a-Box shows you the latest Content Packs
installable in the languages your community needs (from online
libraries like
Kiwix,
OER2Go,
Archive.org)
then takes care of all the downloading details for you!

See

Mexico’s live demo

and our

medical examples

used by clinics in Asia and Africa especially, as hosted by
Wikipedia.

Schools can also choose among

almost 40 powerful apps

for teachers and students — optionally with a complete LMS
(learning management system) like Kolibri, Moodle, Nextcloud,
Sugarizer or WordPress.

Two Haitian schoolgirls working on a laptop

Friendly Community

Internet-in-a-Box is a

community product

enabled by professional volunteers working

side-by-side

with schools, clinics and libraries around the world — and the

Wikipedia community

especially.

Thank you everyone for humbly being part of this

OFF.NETWORK

grassroots learning

movement
.

Please consider

how you too might assist

this epic effort.
It’s astonishing how far we’ve come since Internet-in-a-Box’s
original demo in 2013 — and how far we will go together,
If You Too Can Help!

Read More

GPU Open: Work graphs API – compute rasterizer learning sample

In this article we’re going to practice using the work graphs API, published by Microsoft® earlier. If you’re unfamiliar with the basic syntax of work graphs, there is a great introduction here on GPUOpen: GPU work graphs intro.

Work graphs are exceptionally good at expressing pipelines. If you think about implementing an algorithm on the GPU, you probably often approach it as one monolithic program run on as many GPU work units as possible. After all, splitting an algorithm into multiple phases with the current language and API has many downsides:

  • you have to process each of the phases sequentially – occupancy is poor in the ramp-up and drain of the work units
  • you have to write out all of the information between the phases to memory – eating into your shared memory bandwidth, as caches are unlikely to be able to hold the full data set
  • you have to conservatively allocate the worst case memory footprint on the CPU, as you have no allocation facility on the GPU which could already render the idea impractical
  • alternatively you have to sync with the GPU and read back information – occupancy of the entire chip becomes very poor

Whatever opportunity in workload reduction and shaping that is created by phasing an algorithm, needs to be worth more than the disadvantages caused by it.

The work graphs API is not just providing a syntax to express pipelines and graphs, it solves the aforementioned problems under the hood as well:

  • the nodes in a graph can run simultaneously to each other – caches become effective
  • memory for communication between nodes is allocated on the fly – peak memory consumption is greatly reduced

In this part of the series we’re going to pick a small sample algorithm, show the workload reducing and shaping opportunities arising from breaking it into stages, and how work graphs can maximize performance inherit therein.

Rasterizer

The small sample algorithm we’ve chosen for demonstration is a scanline rasterizer. To keep the scope simple, we picked a algorithm variation which gracefully deals with the edge case of near plane intersections and needs no triangle clipper. Furthermore, this sample isn’t meant to be an industrial-strength copy ‘n’ paste template, it’s meant to teach you about work graphs so many things unrelated to work graphs have been left to be solved, like edge tie-breaking and some precision inaccuracies or hierarchical-Z testing: Triangle Scan Conversion using 2D Homogeneous Coordinates

The algorithm can be described simplified with these steps:

  1. vertex fetch and transform
  2. calculating homogeneous space equations
  3. triangle formation and bounding box calculation
  4. bounding box scan loop over x&y
    1. inside triangle detection
    2. attribute interpolation
    3. attribute write out

It’s a small algorithm (~100 lines of code), that solves the problem of triangle rasterization well. We will simply refer to this algorithm as the rasterizer from this point onwards, and if we say bounding box it means a screen space AABB.

In the following, we design the algorithm, then we go into the implementation details, and finally we’ll see some results. In your own projects there’ll be quite some back and forth between experimentation and consideration but one good aspect of the new work graphs paradigm is that this feels very comfortable to do. Once you have a stable interface between the application and the shader program, you can mostly stay in the shader code and try different approaches.

Development

Let’s transform the rasterizer algorithm step-by-step towards an implementation which works well on a GPU and takes advantage of the capabilities of work graphs. Naturally we’re going to break the algorithm into smaller pieces, taking advantage of routing data between the algorithm stages freely, and cull redundant data away.

  1. A GPU is currently practically a SIMD processor. The width of the SIMD vector varies between implementations and is commonly between 16 and 64 lanes wide.The rasterizer can be vectorized easily. Each lane is responsible to process a different triangle. The scan loop needs to iterate for as long as a participating triangle use its parameters. For example, if you have two triangles with respective bounding boxes of 48×6 and 8×92, then the loop needs to iterate over the range of 48×92 to cover the area of both simultaneously:

    This is rather inefficient as almost all of the area union is empty and not useful to either triangle. It’s better to reduce the dimensionality of the shared loop from 2D to 1D. The two example triangle’s loop parameters become 288 and 736 respectively, which is much better than the prior resulting 4140 iterations:

    float BBStrideX = (BBMax.x - BBMin.x + 1);
    float BBStrideY = (BBMax.y - BBMin.y + 1);
    float BBCoverArea = BBStrideY * BBStrideX - 1.0f;
    
    float y = BBMin.y;
    float x = BBMin.x;
    
    while (BBCoverArea >= 0.0f)
    {
        x += 1.0f;
        y += x > BBMax.x;
        if (x > BBMax.x)
            x = BBMin.x;
    
    BBCoverArea -= 1.0f;
    
  2. While the first step is already a good improvement, it would be nice if we could improve it further still. Can we make it happen that the triangle’s areas are similar to each other? It would allow the participants of the shared loop over the SIMD vector to be more coherent, and we don’t encounter situations where a small triangle is processed together with a big triangle. For this we have to break the rasterizer into two phases:
    1. everything before the loop
    2. the loop itself

    Between these two phases, we can now look at all of the triangles, group them by area, and launch the loops with triangles which are of similar size. Normally, this would cause significant overhead (programming and executing) in the regular Compute API, but with the work graphs API this is fairly straightforward, as we can divert the output of the first phase into a collection of second phases.

    However, there remains a problematic detail, that is that triangles can have areas anywhere between 1 and the entire screen – which is ~8 million for 4K. We don’t want to specify that many variations of the scan loop. Additionally, the frequencies of triangle sizes fall off fairly fast for a real world scene, and you’d be pressed to find more than a couple of triangles occupying the large areas. Instead, we can bin the triangles into a more manageable set of area properties. For generality we should not base this on specific triangle frequencies, but rather choose an expected distribution. A logarithm is a good choice and base 2 is computable very quickly, which non-coincidental establishes a connection to information entropy.

    // Calculate rasterization area and bin
    float2 BBSize = max(BBMax - BBMin + float2(1.0f, 1.0f), float2(0.0f, 0.0f));
    float  BBArea = BBSize.x * BBSize.y;
    int    BBBin  = int(ceil(log2(BBArea)));
    

  3. To support 4K we’d have to provide 23 rasterization loops: 22.93 = log2(~8 mil.), although even with applying the logarithm there are going to be very few triangles in the last bins. Additionally, for generality, we wouldn’t like a limit on the triangle’s sizes anyhow and as one additional worry: a shader program running a loop for 8 mil. iterations is extremely inefficient on a parallel processor like the GPU. What we can do instead is break up larger areas into smaller ones. This may sound familiar because this is what the hardware rasterizer also does. It subdivides screen space into fixed sized tiles. If a large triangle is processed, it distributes the same triangle to that many tiles. It allows the hardware to parallelize triangle rasterization at a modest cost. This idea provides us with three benefits:
    1. unlimited triangle sizes
    2. fewer loop variations → bins → nodes
    3. smaller loop iteration counts → higher parallelization → better


      occupancy

    There is one more indirect benefit emerging which is that we don’t need to rasterize empty bounding boxes. Once we chop up the original triangle’s bounding box into smaller ones then some of the smaller bounding boxes now might reside entirely out of the original triangle’s area. A box is a rather poor approximation for a triangle: essentially half of the box’s area is empty space. If our maximally binneable area is small compared to the incoming area, then we can cull a large amount of iterations which don’t do anything except fail to qualify for the triangle’s interior.

    1. determine bounding box subdivision factors in x&y regarding area limits
    2. loop over bounding box
      1. calculate sub bounding box
      2. test sub bounding box corners for inside triangle
      3. bin surviving sub bounding boxes

    This again, can be parallelized by passing the subdivision factors directly to a Dispatch, much the same as the hardware does.

  4. Now we come to our first work graphs API specific optimization. You can launch a Broadcasting node with dynamic Dispatch parameters:
    struct SplitRecord
    {
        uint3 DispatchGrid : SV_DispatchGrid;
    

    and

    [Shader("node")]
    [NodeLaunch("broadcasting")]
    [NodeMaxDispatchGrid(240, 135, 1)] // Enough of a limit to allow practically unlimited sized triangles
    [NumThreads(SPLITR_NUM_THREADS, 1, 1)]
    void SplitDynamic(
    

    or you can used fixed Dispatch parameters:

    [Shader("node")]
    [NodeLaunch("broadcasting")]
    [NodeDispatchGrid(1, 1, 1)] // Fixed size of exactly 1 work group
    [NumThreads(SPLITR_NUM_THREADS, 1, 1)]
    void SplitFixed(
    

    The fixed Dispatch parameter variation is much easier to optimize by the scheduler than the dynamic Dispatch variation. Consider we’re launching 2x Dispatches X,1,1 and X,1,1 and we don’t use Y and Z as in the following “uint WorkIndex : SV_DispatchThreadID”.

    For a regular Compute Dispatch and for a work graphs dynamic Dispatch these are unknown and variable Xs, and the scheduler together with the allocator need to prepare for these unknowns. But a work graphs fixed Dispatch knows it’s X is always the same, and it knows you’re not using Y and Z. Because of this the scheduler can merge the arguments into one X,2,1 and launch it as one dispatch.

    We can implement this in work graphs as a branch. If the bounding box needs to be tiled into less than 32 fragments we use the fixed Dispatch, otherwise we use the dynamic Dispatch. The key for this to be effective is again triangle size distribution, smaller triangles should be more frequent, and very large triangles should be rare.

Work graph scheme

This concludes the design of our rasterizer taking advantage of the work graphs API paradigm. Part of the thought process is not dissimilar to pipelining an algorithm or multi-threading a larger sequential process. Take advantage of the possibility to make the data associated with a SIMD vector more coherent if needed to achieve better performance. Or consider breaking an algorithm apart at locations where it branches out. This can eliminate dead lanes from the active working set, or recover utilization for conditional code-fragments. As the case for other fine-grained multi-threading practices, try to keep the cost of computations in nodes, if not very similar, then as short as you can. It leads to better occupancy and allows your allocations to be reused efficiently.

This is of course is not a exhaustive set of possible improvements. Among other things you could experiment with:

  • vertex sharing
  • full vs. partially filled bounding boxes
  • small vs. large rasterization loop specializations – the one pixel bin doesn’t need a loop at all
  • clamping subdivision to at least a wave-size to maximize parallelism
  • and more

Below is the resulting explicit algorithm graph which we’ll subsequently implement:

Data structure

The basic data structure we’re using to provide the scene data for the algorithm is as follows:

struct Workload
{
    uint offsetIndex;  // local cluster location in the global shared index buffer
    uint offsetVertex; // local cluster location in the global shared vertex buffer
	                   // (indices are cluster local)
    uint startIndex;   // start of the triangle-range in the index buffer
    uint stopIndex;    // end of the triangle range in the index buffer, inclusive
    
    float4 vCenter;
    float4 vRadius;
    
    matrix mWorldCurrent;
    matrix mWorldPrevious;
};

The Workload item describes a single object to be rasterized. The indices and vertices are all in one continuously accessible buffer, because it allows us to to get away with the traditional binding paradigm of DirectX, binding just two buffer descriptors. You can opt for a bindless paradigm as well, and use descriptor indices instead of global buffer indices.

Apart from that, we also pass the object’s position and extent, as well as its current world matrix. For static objects we’d only need to transmit this buffer once, as none of the information changes over time, but for this project the data is small enough to be transferred every frame and we don’t complicate things further with the need to disambiguate dynamic objects.

We conduct frustum culling on the CPU, and write the active scene elements as an array of Workload items, then we launch the work graph.

Work graph implementation

HLSL

Nodes

Looking at the graph above, we need to implement three different stages from an algorithmic perspective:

  • triangle fetch and transform + bounding box calculation + binning
  • bounding box subdivision and culling + binning
  • bounding box rasterization

For the subdivision we have two launch signatures (fixed vs. dynamic), but the algorithm in the body of the nodes is the same. Similarly with rasterization, while we have k different bins, it just groups the data to be more coherent and doesn’t require us to have specialized implementations.

TriangleFetchAndTransform

This is the root node of the graph which is also the entry point of the program. It has the character of a regular Compute Dispatch in which we launch as many threads as there are triangles but share the object’s ID and parameters, hence we annotate it being of type Broadcasting.

[Shader("node")]
[NodeLaunch("Broadcasting")]
[NodeMaxDispatchGrid(65535, 1, 1)]
[NodeIsProgramEntry]
[NumThreads(TRANSF_NUM_THREADS, 1, 1)]
void TriangleFetchAndTransform(
    uint WorkgroupIndex : SV_GroupID,
    uint SIMDLaneIndex : SV_GroupIndex,

The node is fed by the CPU similar to an indirect argument buffer but we’ll get to that later in the C++ section.

Split

This is the node which amplifies bounding boxes. We start a dispatch-grid as large as the calculated sub-division, but share the encompassing origin bounding box and the triangle’s parameters. It’s also of Broadcasting type.

[Shader("node")]
[NodeLaunch("broadcasting")]
#if (EXPANSION == 1) // fixed expansion
[NodeDispatchGrid(1, 1, 1)] // has the size of one wave, consumes no
dispatch parameters
[NumThreads(SPLITR_NUM_THREADS, 1, 1)]
void SplitFixed(
#elif (EXPANSION == 2) // dynamic expansion
[NodeMaxDispatchGrid(240, 135, 1)] // can be very large, consumes
dispatch parameters
[NumThreads(SPLITR_NUM_THREADS, 1, 1)]
void SplitDynamic(
#else
[NodeMaxDispatchGrid(240, 135, 1)]
[NumThreads(SPLITR_NUM_THREADS, 1, 1)]
void Split(
#endif
    uint  WorkSIMDIndex : SV_DispatchThreadID, // use only one dimension to allow optimizations of the dispatch

To generate different signature (and function name) permutations we use preprocessor conditions and compile the node from the same code multiple times. We’re using different function signatures for fixed and dynamic nodes so they are not compatible/interchangeable with each other according to the current revision of the specification. Because of this, we are unable to use a binning/indexing mechanism to target them and have to explicitly relay data to one or the other.

Rasterize

The leaf node(s) of the graph are the rasterizers. Every thread works on its own bounding box and triangle, and there’s no useful information that could be shared between them. This node is of Thread type.

[Shader("node")]
[NodeLaunch("thread")]
[NodeID("Rasterize", TRIANGLE_BIN)] // indicate position in a node-array, passed in as a preprocessor symbol while compiling the node
void Rasterize(

A Thread launch node is not controlled by an explicit argument about the size of the working set. Instead the implementation operates a bit like a producer consumer queue. Once enough work has accumulated to launch a SIMD wave it will be launched with as many individual parameters as there are lanes. This is quite efficient, because the optimal number of launched treads can be chosen internally and we don’t need to have perfect knowledge of the conditions. For example, this value can be distinct between architectures, or even SKUs. Or sometimes it’s just better for the scheduler to kick out work to prevent stalls, recover allocations, or to prime a shader program on a SIMD unit.

Because the rasterizers are all identical in terms of code, we can use work graph’s powerful concept of node-arrays here. Instead of redirecting work to nodes explicitly via conditional constructs (if/else, switch, …), this concept works similar to jump tables.

Data passing

What’s left to do is defining the connections between the nodes. This is specified explicitly in a top down fashion. The outputs point explicitly to the nodes they target. To ensure these connections are valid, both inputs and outputs specify which data structure they are carrying so that the runtime can verify that the connection is a valid one. The result is a directed acyclic graph.

The declarations of inputs and outputs are part of the function parameter declaration, including their local annotations. This can make work graph function signatures really big. In the following we will only show the particular part of the function signature related to the data record we’re explaining. In the code you’ll see them all together in one big block.

Application → TriangleFetchAndTransform

First we define how we feed data from the application into the graphs root node. Because the root node is of Broadcasting type, we use the required SV_DispatchGrid semantics in the data declaration. The workload-offset pointing to the global workload buffer is similar to a root signature constant – it’s the same over all invocations.

struct DrawRecord
{
    // index into the global scene buffer containing all Workload descriptors
    uint  workloadOffset;
    
    uint3 DispatchGrid : SV_DispatchGrid;
};

Now we declare this struct to be the type used for inputs in our root node function signature:

void TriangleFetchAndTransform(
    // Input record that contains the dispatch grid size.
    // Set up by the application.
    DispatchNodeInputRecord<DrawRecord> launchRecord,

When we combine dispatch global parameters with dispatch local parameters, we can know exactly which triangle every thread is suppose to work on.

TriangleFetchAndTransform → SplitFixed/SplitDynamic

Next, we define how we pass data from our root node into the bounding box subdivision nodes. Because we use two different node types for this (a fixed Dispatch and a dynamic Dispatch, with incompatible function signatures), we also want to declare two different types of data structures. The dynamic Dispatch one has to use the required SV_DispatchGrid semantics in the data declaration. The fixed Dispatch could also use one, because the compiler will silently clear out unused semantic parameters but we keep two versions for clarity and unambiguous compile time verification.

struct SplitDynamicRecord
{
    uint3 DispatchGrid : SV_DispatchGrid;
    
    // Triangle equation, bbox and interpolateable values
    struct TriangleStateStorage tri;
    
    uint2 DispatchDims; // 2D grid range processed by this 1D dispatch
    uint2 DispatchFrag; // 2D size of the subdividing bounding boxes
};
struct SplitFixedRecord
{
    // Triangle equation, bbox and interpolateable values
    struct TriangleStateStorage tri;
    
    uint2 DispatchDims; // 2D grid range processed by this 1D dispatch
    uint2 DispatchFrag; // 2D size of the subdividing bounding boxes
};

These two record types are then added to our root node’s function signature as individual outputs. For each output we have to annotate a couple of more things:

  • how much output data is our work-group potentially producing at most
  • what’s the name of the node receiving the output data

The former allows the allocator to switch allocation strategy depending on the situation and the latter establishes unmistakably where the data goes.

void TriangleFetchAndTransform(
    [MaxRecords(TRANSF_NUM_THREADS)] // N threads that each output 1 bbox-split
    [NodeID("SplitFixed")]
    NodeOutput splitFixedOutput,
    
    [MaxRecords(TRANSF_NUM_THREADS)] // N threads that each output 1 bbox-split
    [NodeID("SplitDynamic")]
    NodeOutput splitDynamicOutput,

Then we declare this struct to be the type used for inputs into our subdivision node function signature:

void SplitFixed/Dynamic(
    // Input record that may contain the dispatch grid size.
    // Set up by TriangleFetchAndTransform.
    #if (EXPANSION == 1) // fixed expansion
    DispatchNodeInputRecord<SplitFixedRecord> splitRecord,
    #else // dynamic expansion
    DispatchNodeInputRecord<SplitDynamicRecord> splitRecord,
    #endif

We don’t change the body of the function, as it’s doing the same operations and the parameters have been kept identical. Instead, we just specialize how it is called via the function signature.

TriangleFetchAndTransform and SplitFixed/SplitDynamic → Rasterize

Finally, we define how we pass data from all of the nodes higher up in the graph into the rasterizer. The only data that’s really necessary for the rasterization loop is the triangle equation, the bounding box to scan, and the values we would like to interpolate.

struct RasterizeRecord
{
    // Triangle equation, bbox and interpolateable values
    struct TriangleStateStorage tri;
};

Although, as we planned in the beginning of the article, we want to collect data with similar performance characteristics together. The native work graph construct to achieve this is the output array. So far we’ve only targeted individual outputs. Output-arrays can be understood as bundles of outputs bound together. Instead of flow control, we use an index to choose which one to output to. When there are only a couple of outputs it might not make much of a difference, but once there are a couple of hundred outputs involved, it becomes highly inefficient to branch. We specify the size of the output array in the annotations of the output together with output limits and output node name. Just like with multiple outputs, it’s not a requirement to select just a single option and feed data mutually exclusive. If an algorithm requires you to feed multiple outputs (or multiple slots in the output-array) with data that’s perfectly fine. Just remember to adjust your peak output size.

void TriangleFetchAndTransform(
    [MaxRecords(TRANSF_NUM_THREADS)] // N threads that each output 1 triangle
    [NodeID("Rasterize")]
    [NodeArraySize(TRIANGLE_BINS)]  // TRIANGLE_BINS defined by application
                                    // during compilation
    NodeOutputArray triangleOutput

and:

void SplitFixed/Dynamic(
    [MaxRecords(SPLITR_NUM_THREADS)] // N threads that each output 1 triangle
    [NodeID("Rasterize")]
    [NodeArraySize(TRIANGLE_BINS)]  // TRIANGLE_BINS defined by application
	                                // during compilation
    NodeOutputArray triangleOutput

Although we may arrive at the rasterization node by different paths, the structure is the same, and the declaration very straightforward:

void Rasterize(
    ThreadNodeInputRecord<RasterizeRecord> triangleRecord

Just like in a regular Compute algorithm, we’ll write out rasterized outputs to a UAV which doesn’t need a special declaration in work graphs.

Data processing

With the inputs and outputs declared and ready to be used, we can implement the I/O. It’s fairly straightforward so we will not go through all instances.

If you have single input, as in the case for Broadcasting and Thread), we can call a simple getter on the declared input:

uint workloadOffset = launchRecord.Get().workloadOffset;

That’s all there is to it. For Coalescing launches you’d just need to pass the index of the element you want to retrieve to the getter.

Outputting is slightly more complex. As dictated by the specification, you have to call the output record allocator in a coherent manner (all lanes leave, or none). To allow lane individual allocation, the API accepts a boolean to indicate if you really want to have a record, or if it’s a dummy allocation.

Let’s take a look at the outputting code of the root node:

const uint triangleBin                             = ts.triangleBBBin; // calculated bin, if not too large
const bool allocateRasterRecordForThisThread       = ts.triangleValid; // true if not too large
const bool allocateSplitDynamicRecordForThisThread = !allocateRasterRecordForThisThread & (DispatchGrid.x != 1); // too large, and many work groups necessary for subdivision
const bool allocateSplitFixedRecordForThisThread   = !allocateRasterRecordForThisThread & (DispatchGrid.x == 1); // too large, but only one work group necessary for subdivision

// call allocators outside of branch scopes, the boolean indicates if we want a real record or not
ThreadNodeOutputRecords<RasterizeRecord> rasterRecord =
    triangleOutput[triangleBin]
	    .GetThreadNodeOutputRecords(allocateRasterRecordForThisThread);
ThreadNodeOutputRecords<SplitDynamicRecord> splitterDynamicRecord =
    splitDynamicOutput
	    .GetThreadNodeOutputRecords(allocateSplitDynamicRecordForThisThread);
ThreadNodeOutputRecords<SplitFixedRecord> splitterFixedRecord =
    splitFixedOutput
	    .GetThreadNodeOutputRecords(allocateSplitFixedRecordForThisThread);

// fill the acquired output records, here we branch according to the conditions
if (allocateRasterRecordForThisThread)
{
    rasterRecord.Get().tri = StoreTriangleState(ts);
}
else if (allocateSplitFixedRecordForThisThread)
{
    splitterFixedRecord.Get().DispatchDims = DispatchDims;
    splitterFixedRecord.Get().DispatchFrag = DispatchFrag;
    splitterFixedRecord.Get().tri = StoreTriangleState(ts);
}
else if (allocateSplitDynamicRecordForThisThread)
{
    splitterDynamicRecord.Get().DispatchGrid = DispatchGrid;
    splitterDynamicRecord.Get().DispatchDims = DispatchDims;
    splitterDynamicRecord.Get().DispatchFrag = DispatchFrag;
    splitterDynamicRecord.Get().tri = StoreTriangleState(ts);
}

// call completion outside of branch scopes, this allows the scheduler to start processing them
rasterRecord.OutputComplete();
splitterDynamicRecord.OutputComplete();
splitterFixedRecord.OutputComplete();

We have three possible output scenarios: 1. output directly to the rasterizer, using the bin as an index into the output array 2. output to the fixed Dispatch splitter without system value 3. output to the dynamic Dispatch splitter with system value

Because we’re taking care not to branch around these calls, it looks a bit verbose, but there’s not much complexity to it. The record you’ll get from the output allocator is of the same type as the input records that you receive through the function. You need to call the getter in the same way. If you allocate multiple entries (not in this project), then you pass the index of the sub-allocation record that you want to access, same as with the record collection given to you in a Coalescing launch.

With this we have everything work graphs-specific implemented in our shader code. In the next chapter we’re going to take a look how to kick off the work graph from the CPU-side.

C++ and Direct3D 12

Create the state object

Creating a work graph program in D3D12 is a bit different than for a regular Graphics or Compute pipeline State Object. Schematically it’s similar to creating a State Object for raytracing for which you also register a large amount of sub-programs.

Firstly, we’ll define the description of our work graph and specify which is the entry point/root node:

D3D12_NODE_ID entryPointId = {
    .Name = L"TriangleFetchAndTransform", // name of entry-point/root node
    .ArrayIndex = 0,
};

D3D12_WORK_GRAPH_DESC workGraphDesc = {
    .ProgramName = L"WorkGraphRasterization",
    .Flags = D3D12_WORK_GRAPH_FLAG_INCLUDE_ALL_AVAILABLE_NODES,
    .NumEntrypoints = 1,
    .pEntrypoints = &entryPointId,
    .NumExplicitlyDefinedNodes = 0,
    .pExplicitlyDefinedNodes = nullptr
};

Then we compile all of our nodes into individual shader blobs, adjusting preprocessor symbols to receive exactly the permutation we want in-between:

CompileShaderFromFile(sourceWG, &defines, 
    "TriangleFetchAndTransform", "-T lib_6_8" OPT, &shaderComputeFrontend);

CompileShaderFromFile(sourceWG, &defines, 
    "SplitDynamic", "-T lib_6_8" OPT, &shaderComputeSplitter[0]);
CompileShaderFromFile(sourceWG, &defines, 
    "SplitFixed", "-T lib_6_8" OPT, &shaderComputeSplitter[1]);

for (int triangleBin = 0; triangleBin < triangleBins; ++triangleBin)
    CompileShaderFromFile(sourceWG, &defines, 
        "Rasterize", "-T lib_6_8" OPT, &shaderComputeRasterize[triangleBin);

The function names we specify here have to match the ones in the HLSL source code, so that the compiler knows which part of the source we want. You can see here that for the splitter we can use explicit function names, because we switch their names in the HLSL. For the rasterize function, we won’t do this, because it’s impractical for large numbers of nodes. Remember we have potentially hundreds of them!

Instead, we use a neat feature of the API: changing function names when linking the program, but more on that in a bit.

After we have compiled all of the individual shader blobs, we can register them as sub-objects for the work graphs program:

std::vector<D3D12_STATE_SUBOBJECT> subObjects = {
    {.Type = D3D12_STATE_SUBOBJECT_TYPE_GLOBAL_ROOT_SIGNATURE,
	 .pDesc = &globalRootSignature},
    {.Type = D3D12_STATE_SUBOBJECT_TYPE_WORK_GRAPH,
	 .pDesc = &workGraphDesc},
};

D3D12_DXIL_LIBRARY_DESC shaderComputeFrontendDXIL = {
    .DXILLibrary = shaderComputeFrontend,
    .NumExports = 0,
    .pExports = nullptr
};

subObjects.emplace_back(D3D12_STATE_SUBOBJECT_TYPE_DXIL_LIBRARY, &shaderComputeFrontendDXIL); // move into persistent allocation

D3D12_DXIL_LIBRARY_DESC shaderComputeSplitter0DXIL = {
    .DXILLibrary = shaderComputeSplitter[0],
    .NumExports = 0,
    .pExports = nullptr
};

subObjects.emplace_back(D3D12_STATE_SUBOBJECT_TYPE_DXIL_LIBRARY, &shaderComputeSplitter0DXIL); // move into persistent allocation

D3D12_DXIL_LIBRARY_DESC shaderComputeSplitter1DXIL = {
    .DXILLibrary = shaderComputeSplitter[1],
    .NumExports = 0,
    .pExports = nullptr
};

subObjects.emplace_back(D3D12_STATE_SUBOBJECT_TYPE_DXIL_LIBRARY, &shaderComputeSplitter1DXIL); // move into persistent allocation

So far so good. Now we want to register the rasterizer shader blobs and those we have to rename. The linker will automatically match function names with the output array declaration that we gave in HLSL:

[NodeID("Rasterize")]
NodeOutputArray<RasterizeRecord> triangleOutput

The pattern which the matcher follows is _. We will specify that we want the “Rasterize” to be renamed to “Rasterize_0” and so on in the following:

for (int triangleBin = 0; triangleBin < triangleBins; ++triangleBin)
{
    auto entryPoint = L"Rasterize";
    
    // Change "Rasterize" into "Rasterize_k"
    refNames.emplace_back(std::format(L"{}_{}", entryPoint, triangleBin)); // move into persistent allocation
    
    // Tell the linker to change the export name when linking
    D3D12_EXPORT_DESC shaderComputeRasterizeExport = {
        .Name = refNames.back().c_str(),
        .ExportToRename = entryPoint,
        .Flags = D3D12_EXPORT_FLAG_NONE
    };
    
    expNames.emplace_back(shaderComputeRasterizeExport); // move into persistent allocation
    
    D3D12_DXIL_LIBRARY_DESC shaderComputeRasterizeDXIL = {
        .DXILLibrary = shaderComputeRasterize[triangleBin],
        .NumExports = 1,
        .pExports = &expNames.back()
    };
    
    libNames.emplace_back(shaderComputeRasterizeDXIL); // move into persistent allocation
    
    subObjects.emplace_back(D3D12_STATE_SUBOBJECT_TYPE_DXIL_LIBRARY, &libNames.back());
}

Additionally there’s a bit of code to prevent dangling points from out-of-scope deallocations.

Now we have collected and defined all necessary information to create our work graphs program:

const D3D12_STATE_OBJECT_DESC stateObjectDesc = {
    .Type = D3D12_STATE_OBJECT_TYPE_EXECUTABLE,
    .NumSubobjects = static_cast<UINT>(subObjects.size()),
    .pSubobjects = subObjects.data()
};

ThrowIfFailed(m_pDevice->CreateStateObject(&stateObjectDesc, IID_PPV_ARGS(&m_pipelineWorkGraph)));

It’s certainly more complicated to set up than one of the predefined hardware pipeline stages (VS, PS, CS etc.) but it allows you to flexibly assemble a variety of different work graphs from common building blocks without the need to leak the specification of that variety into your HLSL code.

Allocate backing memory

As we said in the beginning, one of the beneficial capabilities of work graphs is that it allows you to make allocations to pass data from one node to the next but the memory pool from which the implementation draws it’s allocations from is not some hidden implicit one. After we have the work graphs State Object we have to find out how large that memory pool should be in total. We also have to allocate and provide a buffer to the implementation ourselves:

ComPtr<ID3D12WorkGraphProperties> workGraphProperties;
ThrowIfFailed(m_pipelineWorkGraph->QueryInterface(IID_PPV_ARGS(&workGraphProperties)));

D3D12_WORK_GRAPH_MEMORY_REQUIREMENTS workGraphMemoryRequirements;
workGraphProperties->GetWorkGraphMemoryRequirements(0, &workGraphMemoryRequirements);

const auto workGraphBackingMemoryResourceDesc = CD3DX12_RESOURCE_DESC::Buffer(workGraphMemoryRequirements.MaxSizeInBytes);

const auto defaultHeapProps = CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_DEFAULT);

ThrowIfFailed(m_pDevice->CreateCommittedResource(
    &defaultHeapProps,
    D3D12_HEAP_FLAG_NONE,
    &workGraphBackingMemoryResourceDesc,
    D3D12_RESOURCE_STATE_COMMON,
    nullptr,
    IID_PPV_ARGS(&m_memoryWorkGraph
)));

const D3D12_GPU_VIRTUAL_ADDRESS_RANGE backingMemoryAddrRange = {
    .StartAddress = m_memoryWorkGraph->GetGPUVirtualAddress(),
    .SizeInBytes = workGraphMemoryRequirements.MaxSizeInBytes,
};

The size of the backing memory depends on the structure of your Work graph. For example, how many outputs are there?, what are the types of the outputs? and so on. The memory size will be sufficient for the Work graph implementation to always have the ability to make forward progress.

Run the program

Finally, we are ready to start the work graph on the GPU. We have:

  • a buffer with DrawRecords which contains the inputs for our root-node
  • a backing memory to draw allocations from
  • a State Object which contains our implementation

Again, setting the work graphs implementation is requiring us to set up a couple of structures to tell the runtime exactly what we want to happen:

ComPtr<ID3D12WorkGraphProperties> workGraphProperties;
ThrowIfFailed(m_pipelineWorkGraph->QueryInterface(IID_PPV_ARGS(&workGraphProperties)));

ComPtr<ID3D12StateObjectProperties1> workGraphProperties1;
ThrowIfFailed(m_pipelineWorkGraph->QueryInterface(IID_PPV_ARGS(&workGraphProperties1)));

D3D12_WORK_GRAPH_MEMORY_REQUIREMENTS workGraphMemoryRequirements;
workGraphProperties->GetWorkGraphMemoryRequirements(0, &workGraphMemoryRequirements);

// Define the address and size of the backing memory
const D3D12_GPU_VIRTUAL_ADDRESS_RANGE backingMemoryAddrRange = {
    .StartAddress = m_memoryWorkGraph->GetGPUVirtualAddress(),
    .SizeInBytes = workGraphMemoryRequirements.MaxSizeInBytes,
};

// Backing memory only needs to be initialized once
const D3D12_SET_WORK_GRAPH_FLAGS setWorkGraphFlags =
m_memoryInitialized ? D3D12_SET_WORK_GRAPH_FLAG_NONE
                    : D3D12_SET_WORK_GRAPH_FLAG_INITIALIZE;
m_memoryInitialized = true;

const D3D12_SET_WORK_GRAPH_DESC workGraphProgramDesc = {
    .ProgramIdentifier = workGraphProperties1->GetProgramIdentifier(L"WorkGraphRasterization"),
    .Flags = setWorkGraphFlags,
    .BackingMemory = backingMemoryAddrRange
};

const D3D12_SET_PROGRAM_DESC programDesc = {
    .Type = D3D12_PROGRAM_TYPE_WORK_GRAPH,
    .WorkGraph = workGraphProgramDesc
};

// Set the program and backing memory
pCommandList->SetProgram(&programDesc);

// Submit all of our DrawRecords in one batch-submission
const D3D12_NODE_CPU_INPUT nodeCPUInput {
    .EntrypointIndex = 0,
    .NumRecords = UINT(m_Arguments.size()),
    .pRecords = (void*)m_Arguments.data(),
    .RecordStrideInBytes = sizeof(DrawRecord),
};

const D3D12_DISPATCH_GRAPH_DESC dispatchGraphDesc = {
    .Mode = D3D12_DISPATCH_MODE_NODE_CPU_INPUT,
    .NodeCPUInput = nodeCPUInput,
};

pCommandList->DispatchGraph(&dispatchGraphDesc);

This now launches as many work graph instances as we have objects in the scene. Every work graph instance goes through as many vertices as there are in each object in its root-node and continues through the graph until it arrives at the leaf nodes, where we write out our rasterized and interpolated values to a screen-sized UAV.

Results

As we’ve reached the end of the article, let’s take a look at the results and see if everything worked out as we anticipated. In the implementation found in our rasterizer repository we are providing you with a variety of controls so that you can explore the performance differences in real-time.

Sponza

This scene is from the glTFSample repository. It’s Crytek’s Sponza scene. It has some very large triangles and many smaller ones. Darker shades indicate smaller bounding boxes.

Times are given in 10ns ticks. The tested GPU is a 7900 XTX. The tested Driver is 23.20.11.01. For better visibility, there’s a graph plotted with the log10(ratio) against a monolithic (plain GPU SIMD) implementation.

bins monolithic ExIn multi-pass ExIn dynamic WoGr fixed WoGr
6 9598764 625700 5267600 2158076
7 9598764 898932 3964992 1842596
8 9598764 698976 3020956 1730860
9 9598764 776720 2466808 1661224
10 9598764 507876 2111640 1598632
11 9598764 467532 1896800 1575920
12 9598764 373704 1461440 395980
13 9598764 347604 1229584 368704
14 9598764 301444 863680 248180
15 9598764 320772 731676 310732

As you can see, performance differs significantly depending on the number of bins and thus the maximum size of rasterized bounding boxes. However, a sweet spot appears to be around 14 bins which is about 214 = 16k pixels, 128×128 among it. The work graph implementation has a moderate sized backing buffer of ~130MB. It is more than ~38x faster than the monolithic ExecuteIndirect implementation. The fixed Dispatch variation outperforms the dynamic Dispatch only variation by ~3x.

If we’d compare against a complicated multi-pass ExecuteIndirect implementation, which is much less flexible in terms of memory consumption and eats a large amount of memory bandwidth, we come out on top by ~1.2x.

Recommendations

  • If you can take advantage of sharing data between inputs: use a Coalescing node, otherwise use Thread.
  • Static parameters can be optimized, dynamic ones not so use fixed expansion where appropriate.
  • Reduce the dimensions of the dispatch to help the scheduler.
  • Start with as few outputs as possible to help the allocator. Allocation is expensive!
  • Measure if more expensive additional outputs may be compensated for by a better approach (see fixed vs. dynamic).

Availability

The sample code for this tutorial is available on GitHub: https://github.com/GPUOpen-LibrariesAndSDKs/WorkGraphComputeRasterizer

To allow you to experiment with a large variety of scenes, it’s based on the glTFSample already available on GitHub: https://github.com/GPUOpen-LibrariesAndSDKs/glTFSample

A fantastic repository for glTF scenes for you to check out is also on GitHub: https://github.com/KhronosGroup/glTF-Sample-Models

The source codes touch on in this tutorial can be found here:

  • DirectX® 12 setup/management code: libs/Cauldron/src/DX12/GLTF/GltfRasterizerPass.cpp
  • DirectX® 12 shaders: libs/Cauldron/src/DX12/shaders/Rasterizer*.hlsl

Disclaimers

Links to third-party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites, and no endorsement is implied. GD-98

Microsoft is a registered trademark of Microsoft Corporation in the US and/or other countries. Other product names used in this publication are for identification purposes only and may be trademarks of their respective owners.

DirectX® is a registered trademark of Microsoft Corporation in the US and/or other countries.

Read More

By |2023-10-15T08:08:35+00:00October 15, 2023|Life and Health|0 Comments

About the Author:

Leave A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Go to Top