Recently the Vulkan API received an exciting new feature, which is video decoding, utilizing the built-in fixed function video unit found in many GPUs. This allows to the writing of super fast cross-platform video applications while freeing up the CPU from expensive decoding tasks.

Take a look at an example using real time video decoding with Vulkan in the Wicked Engine game demo:

To give some examples, videos in the game scene is a pretty powerful tool that can provide interesting visuals for a modern city scape, a virtual cinema or TV screens, monitors. Once it’s working, it is very simple to add a video to the objects opposed to making flipbook texture atlases which is a more old school way of doing things, and we also benefit from the video codec’s compression.

Why would you want to have a low level video API and not use an already available library? Because:

  • Take a look at FFMPEG, the most popular and feature complete video library out there: it’s huge, consisting of several DLL files and lots of dependencies.
  • No interop with other libraries, you only have to use the usual Vulkan graphics objects: buffers for providing compressed data, and textures for uncompressed video frames. You can easily reuse the decoded video as regular textures on your objects.
  • All your resources can stay fully on GPU, and all the heavy processing is performed on the GPU’s video unit.
  • You can fully utilize Vulkan synchronization and async queues to have a really low overhead. After all, the video unit is a separate part of the GPU and it makes much sense to run video tasks asynchronously to your main rendering.

However, don’t expect a walk in the park if you want to add Vulkan video functionality into your program. If you know anything about Vulkan, you know that usually it is very complicated to do anything and video is no different. It requires deep understanding of the video format that you want to decode. The first two formats that are usable today (in non-beta version) are H264 and H265. So far I have tried the older and more popular H264, so I will detail the process specifically for this.

Note: all information here is probably inaccurate, it just shows what I learned while bringing up video decoding for the first time.

Inputs – MP4

The first thing you need is a H264 video. This most commonly comes from an MP4 file, which is a container format containing the H264 (optionally) and other metadata. You will need to “demux” the MP4 to get the H264 data from it. You can do so with this single header library called MiniMP4, or rolling one yourself. To check how to use the MiniMP4 demuxer, take a look at their sample, or my example in Wicked Engine (look for the CreateVideo function).

Inputs – H264

From the H264 data, you will have to find Network Abstraction Layer units (NAL units) for the corresponding elements:

  • Picture Parameter Set (PPS)
  • Sequence Parameter Set (SPS)
  • Slice Header

The NAL units are separated with three (0,0,1) or four (0,0,0,1) bytes, that’s what you need to find, then you must read in the NAL unit header to determine the type of data you will read next. I have created a minimal H264 parser header file that only handles these units that can be passed to the Vulkan decoder. My header file is based on the H264 bitstream library, but greatly reduced, rewritten for easier c++ syntax and removed allocations. You can very easily use the h264 parser header file, by creating a Bitstream from a data pointer, then calling read_nal_header(), read_pps(), read_sps() and read_slice_header() functions which will create simple plain old data structures for you to use.

To put it simply, the SPS and PPS structures are descriptions for the whole video, while the Slice Header is a description for a single Slice (basically a video frame). To complicate things, there can be multiple SPS and PPS, and a Slice Header can index them, telling which one to use. To complicate even further, there can be multiple Slice Headers (and Slices) for one video frame, for example when the video is interlaced. In the common case, mostly I have seen videos that are in the simple case, so 1 PPS, 1 SPS and 1 Slice Header per frame so at first you can implement this easy way and then handle more complicated situations later.

Inputs – Vulkan

The general idea is that you will put the compressed H264 bitstream in a GPU buffer (VkBuffer object). It is up to you if you put all the frames into one buffer, or upload frames on demand. If you put multiple frames into a buffer, you need to make sure that you put each slice data at a correctly aligned offset. You can find out the offset alignment required from VkVideoCapabilitiesKHR::minBitstreamBufferOffsetAlignment. You must also use an aligned size, that you find from VkVideoCapabilitiesKHR::minBitstreamBufferSizeAlignment. You can get the video capabilities structure by using vkGetPhysicalDeviceVideoCapabilitiesKHR() function if the video extensions are supported by the GPU.

The bitstream buffer will need to be marked with VK_BUFFER_USAGE_VIDEO_DECODE_SRC_BIT_KHR to be usable in video decoding and the creation parameters structure must include a VkVideoProfileListInfoKHR structure in its pNext chain. The profile list must be filled with the video profile information that the buffer will be used with.

One thing that was not evident for me, is that what data exactly you have to provide in the VkBuffer, but after a lot of trying and inspecting the memory of the Nvidia video sample’s buffers, I determined that you have to provide the 0,0,1 NAL start code bytes, followed by NAL unit header, followed by Slice Header and Slice Data. You can throw out the PPS and SPS data, though if they remain in the buffer, the video decoder will consume those without issue. But very importantly, make sure that each unit is separated correctly with the NAL start codes (0,0,1) or (0,0,0,1) byte sequence, otherwise you will most likely get nothing – no result and no errors.

An other thing, is that I was unable to get any results with normal buffers that are in default GPU memory. I had to use host mapped buffers (“upload buffers” in DX12 terminology). These are buffers that are available for CPU writes and GPU reads. The vulkan validation didn’t give any reasons why.

Other than the bitstream buffers, Vulkan requires to fill a lot of description structures:

  • From the PPS, you will need to fill the Vulkan’s StdVideoH264PictureParameterSet and StdVideoH264ScalingLists structures.
  • From the SPS, you will need to fill the StdVideoH264SequenceParameterSet, StdVideoH264SequenceParameterSetVui and StdVideoH264HrdParameters structures.

You will later pass those structures to VkVideoDecodeH264SessionParametersAddInfoKHR, which is passed to the VkVideoDecodeH264SessionParametersCreateInfoKHR when creating the video session. There are also more options for you if you don’t want to provide this at session creation time, but want to update the SPS and PPS later, in this case look for VkVideoSessionParametersUpdateInfoKHR and vkUpdateVideoSessionParametersKHR().

Video session

The video session is the video specific main Vulkan object that you need to create. It will require you to query memory requirements with vkGetVideoSessionMemoryRequirementsKHR and then allocate and bind memory appropriately with vkBindVideoSessionMemoryKHR(). Then you will use the VkVideoDecodeH264SessionParametersCreateInfoKHR and VkVideoSessionParametersCreateInfoKHR structures when calling vkCreateVideoSessionParametersKHR().

The video session is used for internal memory allocations for video coding. When decoding video, you will always need to call:

  • vkCmdBeginVideoCodingKHR()
  • vkCmdDecodeVideoKHR()
  • vkCmdEndVideoCodingKHR()

vkCmdDecodeVideoKHR() needs to be between the begin/end video coding. Before first use of a video session, you will also have to call vkCmdControlVideoCodingKHR() with the VK_VIDEO_CODING_CONTROL_RESET_BIT_KHR flag (between begin/end video coding commands)

But you also need to specify the decoding outputs for the video coding commands first.

Outputs – H264

The H264 video codec is based on the concept of Intra and Predictive frames (and variations on them). Intra frames are full video frames that can be decompressed by themselves, for example the first frame of the video will most likely be an Intra (I) frame. Predictive (P) frames can only be decompressed by referencing other frames, because their data only contains differences to other frames (this means their compressed data is also much smaller). Note, that P-frames can reference not only one, but many other frames. Also, reference frames are not necessarily I-frames, so don’t make those assumptions like I did.

To make having P-frames possible, it is needed to have a buffer of history frames. The SPS structure’s num_ref_frames tells us how many reference frames need to be kept. However, the currently decoded frame can not be used as reference, so we must have a buffer of num_ref_frames + 1 elements. Note that H264 specifies the maximum number of reference frames as 16, so the num_ref_frames can never be larger than this. You can use this knowledge to allocate simple temporary arrays on stack for example, since you know there can be never more than 17 DPB slots (16 reference frames + 1 current frame). The frames are actually textures, so this means we can use a texture array for example with array_size = num_ref_frames + 1. The texture will need to be a YUV420 texture, and Vulkan provides us with the VK_FORMAT_G8_B8R8_2PLANE_420_UNORM for this purpose.

This array texture is called the Decoded Picture Buffer (DPB for short). We must also use the following flags when creating it: VK_IMAGE_USAGE_VIDEO_DECODE_DPB_BIT_KHR, VK_IMAGE_USAGE_VIDEO_DECODE_SRC_BIT_KHR, VK_IMAGE_USAGE_VIDEO_DECODE_DST_BIT_KHR. It is possible also to not use a texture array, but individual textures, based on video capabilities and that way we perhaps wouldn’t need to use all the flags on all of them. There might be hardware requirements though that require placing the DPB images into a texture array, so you will need to check those too. I think the VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_COINCIDE_BIT_KHR flag tells us this in the VkVideoDecodeCapabilitiesKHR if I’m not mistaken.

For every decode operation, we must inform Vulkan about our current DPB state, so which slice we use as current decoded result/destination and which slices will be used as reference images. Managing the DPB state is up to the application, and telling Vulkan is done through the following chained structures:

  • VkVideoReferenceSlotInfoKHR
  • VkVideoPictureResourceInfoKHR
  • VkVideoDecodeH264DpbSlotInfoKHR
  • StdVideoDecodeH264ReferenceInfo

All these structures basically tell is the VkImageView of the DPB texture, the array slice, offsets/width/height and some H264 parameters. The StdVideoDecodeH264ReferenceInfo contains the H264 parameters for the DPB slots, most importantly:

  • flags.bottom_field_flag and flags.top_field_flag, both should be set to 1 for progressive (non-interlaced) video frames. This is important and easy to miss, and one of the main reason I have seen video corruption.
  • FrameNum: this is coming from the slice header’s frame_num value
  • PicOrderCnt: this is an array of two integers, for progressive frame you must set both to the same value. Calculating this value is a bit complicated, I will describe it below.

PicOrderCnt or Picture Order Count is telling us the frame display ordering, because in H264, the frame data order in memory is in the order of decoding, not the displaying. Each slice header contains the pic_order_cnt_lsb value, which must be added to pic_order_cnt_msb to compute the value of PicOrderCnt. But the pic_order_cnt_msb is not available, but must be calculated based on SPS::log2_max_pic_order_cnt_lsb_minus4 and SPS::pic_order_cnt_type.

I give my code example for this, which works okayish for me but it could or could not have some issues (I’ve seen some frame pacing stutter in some videos that I couldn’t figure out yet)

// Tracking values declared before iterating slices:
int prev_pic_order_cnt_lsb = 0;
int prev_pic_order_cnt_msb = 0;
int poc_cycle = 0;

// For each slice, do the following:

// Rec. ITU-T H.264 (08/2021) page 77
int max_pic_order_cnt_lsb = 1 << (sps.log2_max_pic_order_cnt_lsb_minus4 + 4);
int pic_order_cnt_lsb = slice_header->pic_order_cnt_lsb;

if (pic_order_cnt_lsb == 0)
{
	poc_cycle++;
}

// Rec. ITU-T H.264 (08/2021) page 115
// Also: https://www.ramugedia.com/negative-pocs
int pic_order_cnt_msb = 0;
if (pic_order_cnt_lsb < prev_pic_order_cnt_lsb && (prev_pic_order_cnt_lsb - pic_order_cnt_lsb) >= max_pic_order_cnt_lsb / 2)
{
	pic_order_cnt_msb = prev_pic_order_cnt_msb + max_pic_order_cnt_lsb; // pic_order_cnt_lsb wrapped around
}
else if (pic_order_cnt_lsb > prev_pic_order_cnt_lsb && (pic_order_cnt_lsb - prev_pic_order_cnt_lsb) > max_pic_order_cnt_lsb / 2)
{
	pic_order_cnt_msb = prev_pic_order_cnt_msb - max_pic_order_cnt_lsb; // here negative POC might occur
}
else
{
	pic_order_cnt_msb = prev_pic_order_cnt_msb;
}
//pic_order_cnt_msb = pic_order_cnt_msb % 256;
prev_pic_order_cnt_lsb = pic_order_cnt_lsb;
prev_pic_order_cnt_msb = pic_order_cnt_msb;

// final value of PicOrderCnt:
PicOrderCnt = pic_order_cnt_msb + pic_order_cnt_lsb;

To learn more about different picture order count types, read this: H264 picture management.

So for one thing, you will need the PicOrderCnt to provide the data to Vulkan, but also to determine the correct display order of frames. You basically need to sort the frames by increasing value of PicOrderCnt (POC from now on), but watch out that the value will be occasionally wrap around and reset to zero. You will need to handle that by grouping the frames by POC wrapping cycle. When I detect that POC wrapped, I assign an increased GOP value (stands for Group Of Pictures) to every frame besides the POC value. Then when sorting, I create a priority value for every frame from the GOP and POC values, by putting the GOP value to high 32bits of a 64-bit integer, and POC value to the low 32 bits. For example, If I have a POC sequence like this, this is how the GOP value will be tracked:

  • frame 1: POC 0, GOP 0 <== initialize GOP to 0
  • frame 2: POC 3, GOP 0
  • frame 3: POC 1, GOP 0
  • frame 4: POC 2, GOP 0
  • frame 5: POC 0, GOP 1 <== POC wrapped to 0, increase GOP
  • frame 6: POC 2, GOP 1
  • frame 7: POC 1, GOP 1
  • frame 8: POC 0, GOP 2 <== POC wrapped to 0, increase GOP
  • frame 9: POC 1, GOP 2

Then sorting can be done like this for example:

std::vector frame_display_order(video->frames_infos.size());
for (size_t i = 0; i < video->frames_infos.size(); ++i)
{
	frame_display_order[i] = i;
}
std::sort(frame_display_order.begin(), frame_display_order.end(), [&](size_t a, size_t b) {
	const Video::FrameInfo& frameA = video->frames_infos[a];
	const Video::FrameInfo& frameB = video->frames_infos[b];
	int64_t prioA = (int64_t(frameA.gop) << 32ll) | int64_t(frameA.poc);
	int64_t prioB = (int64_t(frameB.gop) << 32ll) | int64_t(frameB.poc);
	return prioA < prioB;
});

After this, the frame_display_order array is sorted by display order priority, and each element contains an index into the original video frames array (video->frame_infos). The display order for the above sequence looks like this:

  • frame 1
  • frame 3
  • frame 4
  • frame 2
  • frame 5
  • frame 7
  • frame 6
  • frame 8
  • frame 9

Now the trick is that you want to decode the frames in original decoding order, but only display a new frame when the next required display order value is reached, which is always increasing by one when a new frame is displayed. For this, I copy back the display order values to the original video frame infos:

for (size_t i = 0; i < frame_display_order.size(); ++i)
{
	video->frames_infos[frame_display_order[i]].display_order = (int)i;
}

The other trick is that your DPB (Decoded Picture Buffer array texture) state management can be decoupled from the display ordering to simplify it (at least that’s how I did it). Each time a new frame is decoded, it will be resolved to a new RGB texture, so I keep an other buffer of textures independently from the DPB. Even though the newly decoded frame is resolved into an RGB texture (later I show an example how), it is not displayed yet if it’s not the next displayable one, but kept around until needed. The good thing is that the DPB slot can be reused after this immediately for the next decodable frame (or kept around if it’s needed as a reference).

The simple DPB management scheme I am using goes like this:

  • You start with a DPB array texture with array_size = sps.num_ref_frames + 1
  • Have a FIFO queue to keep track of reference frames, maximum size is DPB array_size – 1
  • Track the current DPB slot that’s going to contain decode result, it’s an index that starts from zero
  • If current frame is an IDR-frame (Intra frame and also a reference), clear the reference frame queue
  • Decode frame into current DPB slot, while you provide other slot indices to Vulkan that were used as reference
  • Resolve the current decode result into an RGB texture that will be used for display
  • If current frame is also a reference frame (if NALHeader::idc > 0) then add this slot as active reference to the queue. If queue is full, remove oldest element (FIFO). Also increase the current slot index because next frame mustn’t overwrite the current slot as it will be used as reference. The next slot index must be wrapped in the range [0, DPB.array_size)
  • If current frame is not a reference frame, then the current slot can be reused in the next decode operation, there is nothing else to do

The scheme above is as you can see a simple FIFO queue that track reference frames and one output decode frame. However the H264 specification also has a “long term reference frame” flag, which this doesn’t handle. Currently I haven’t had a video which uses long term reference frame, thus haven’t tried that yet.

On the Vulkan side, the VkVideoBeginCodingInfoKHR requires you to state all the DPB slots you are going to use in the pReferenceSlots and referenceSlotCount parameters. Don’t confuse this with the DPB references, because these are referring to the array of VkVideoReferenceSlotInfoKHR parameters, that will not necessarily be all describing a reference image, one of them will also describe the current DPB decode slot. For the current decode slot, you must specify VkVideoReferenceSlotInfoKHR::slotIndex = -1 to tell that this slot is not used as reference, but it will be activated now and used in video coding operations – until vkCmdEndVideoCodingKHR is not called.

Contrast to the VkVideoBeginCodingInfoKHR, in the VkVideoDecodeInfoKHR’s pReferenceSlots and referenceSlotCount you must only include the slots that are used as reference for the decode operation, not the currently written DPB slot that you activate in the VkVideoBeginCodingInfoKHR. Thankfully, messing up here will trigger validation layer messages which can help in the otherwise confusing namings.

Resolving to RGB

First, in Vulkan there is an optional hardware feature called sampler YcbCr conversion which lets you skip this step and sample from YUV textures directly. My goal was also to support video decoding in DX12, so I decided to not use this, because it is not available there. In this case, you will need to resolve the YUV textures into RGB textures with a shader. As a sidenote, I also generate mipmap chain for video frames, and that is sure to be supported for regular RGB textures.

The before mentioned YUV texture format VK_FORMAT_G8_B8R8_2PLANE_420_UNORM is a multi planar format, meaning it has the luminance part stored separately in memory from the chrominance part.

  • The luminance part is a full resolution 8 bit per pixel image and can be accessed with VK_IMAGE_ASPECT_PLANE_0_BIT. It can be viewed as a shader resource with the VK_FORMAT_R8_UNORM or VK_FORMAT_R8_UINT formats.
  • The chrominance is a 8+8 bit per pixel image, but the resolution is half of the luminance in both dimensions and can be accessed with VK_IMAGE_ASPECT_PLANE_1_BIT. It can be viewed as shader resource with the VK_FORMAT_R8G8_UNORM or VK_FORMAT_R8G8_UINT formats.

I recommend creating the shader resource views as UNORM formats, to gain the ability to use the texture sampling operation, which is especially helpful for the chrominance image – that will be bilinearly sampled because it’s resolution is lower than luminance.

This means that in Vulkan, you will need to create two separate image views for one YUV texture, using two different subresource aspects, and bind those to a shader. Here is a shader snippet that shows you how to declare, sample and convert the luminance and chrominance images to RGB (in HLSL):

// Declarations before shader:
Texture2DArray input_luminance : register(t0);
Texture2DArray input_chrominance : register(t1);
RWTexture2D output : register(u0); // for example VK_FORMAT_R8G8B8A8_UNORM

// Sampling in shader:
float luminance = input_luminance.SampleLevel(sampler_linear_clamp, uv, 0);
float2 chrominance = input_chrominance.SampleLevel(sampler_linear_clamp, uv, 0);

// Converting:
float C = luminance - 16.0 / 255.0;
float D = chrominance.x - 0.5;
float E = chrominance.y - 0.5;

float r = saturate(1.164383 * C + 1.596027 * E);
float g = saturate(1.164383 * C - (0.391762 * D) - (0.812968 * E));
float b = saturate(1.164383 * C + 2.017232 * D);

// Output to RGB texture:
output[DTid.xy] = float4(r, g, b, 1);

You can find a reference guide to YUV texture conversions on Microsoft MSDN.

Performance

I was having difficulties getting accurate performance readings, because timestamp query is currently not working on the video decode queue it seems (I’m using Nvidia RTX 2060 laptop GPU). Nvidia Nsight GPU profiler also doesn’t display video decoding information unfortunately. There is a way to time the video queue however, by issuing timestamps in a queue that executes before the video queue, and on that executes after, and the queues are synced with each other. With this method, I see that it takes about 1.8 milliseconds decoding a Full HD (1920 * 1080) H264 video frame, so I can decode 9 Full HD videos while maintaining 60 FPS. For a 4K video (3840 * 2160) it takes 5.2 milliseconds to decode a frame, so it would be possible to run 3 4K videos at once while maintaining 60 FPS on this GPU.

TLDR:

  • Nvidia RTX 2060 (laptop) GPU
  • 1080p (Full HD) video decode: 1.8 ms / frame
  • 2160p (4K) video decode: 5.2 ms / frame

And remember that you can run the video decoding tasks asynchronously with many other GPU work, for example I am running video decoding in the beginning of the frame while things depth-prepass is rendering on graphics queue, particle and BVH updating is running on compute queue and virtual texture copies are running on the transfer queue. That means much of the video decoding is actually free, until you need one of the queues to wait for decoding results to complete.

DirectX 12

I was planning to implement decoding in DX12 which is similarly low level to Vulkan: I also needs to allocate and track DPB resources by user, and provide similar structures describing the SPS, PPS and slice headers. Although it uses a mush lower amount of structures, they are more confusing as they require understanding yet an other specification, the DXVA for H264, because the DecodeFrame command requires you to provide DXVA structures that describe the video frame. Aside from reading the linked DXVA spec, you can also take a look at the Mesa open source library, specifically this file that fills the DXVA structures. Sadly, I was unable to get the DX12 decoder up and running, currently I get completely broken result on Intel, and unavoidable crashing on Nvidia, even though I think all parameters are specified correctly.

Closing

After spending about a month to bring video decoding to Wicked Engine in Vulkan and DX12, I learned a lot and glad at least Vulkan implementation was successful. I wanted to give up somewhere half way because it was turning out to be a much bigger task than I anticipated. While my implementation has some issues in the beginning like some frame stuttering in some videos, it is already very useful as I can place videos really easily into the game world. You can look at my codes in Wicked Engine:

  • wiVideo: implementation for MP4 video file loading, demuxing, parsing SPS, PPS, SliceHeader, DPB management
  • GraphicsDevice_Vulkan: Vulkan implementations, especially the CreateVideoDecoder() and VideoDecode() functions
  • GraphicsDevice_DX12: using the same interface as Vulkan, but DX12 implementation
  • h264.h: minimal h264 parser just to pass video information to decoder

Video resources by others:

Happy video decoding!

Read More