BitKnit is a LZ77+RANS lossless general-purpose compressor that was designed 8 years ago (1.0 release on May 20, 2015) to replace the aging and very slow to encode LZ77+arithmetic coding codec built into Granny 3D, and also worked – somewhat unintentionally – as a LZ77/entropy coding test vehicle. The codec did end up in Granny and later a small variation (with slightly different bitstream to work within the different container) in Oodle 2.1.2. In Oodle it was quickly superseded by the “oceanic cryptozoology” (first Kraken, then later Mermaid/Selkie and Leviathan) codecs, so BitKnit the actual codec is just a historical curiosity these days (and deprecated in current Oodle versions), but a lot of its ideas ended up in other Oodle codecs, and might be of general interest. So I’ll keep it brief and focus mainly on the parts of it that actually got successfully used elsewhere.

Basic overview

BitKnit (I’ll just write BK from here on out) is basically your usual LZ77 with entropy coding backend. Like LZX and then LZMA, it keeps a history of recent match offsets that can be cheaply reused; BK was designed mainly for Granny files, which are chock full of fixed-size records with highly structured offsets, so unlike LZMAs LRU history of 3 recent match offsets, it keeps 7. This is somewhat expensive to maintain in the encoder/decoder and mostly a waste for ASCII text and text-based formats, but turns out to be very useful for the structured binary data that was its intended use case.

For entropy coding, it uses a 2-way interleaved rANS coder with 32-bit state that emits 16-bit words at a time, and 15-bit “semi-adaptive” (aka deferred summation) multi-symbol (i.e. not binary) models. Semi-adaptive meaning that they change over time, but not after every symbol; model updates are batched and the model remains static in between, which lets it amortize update cost and build some auxiliary data structures to accelerate decoding. The particular model it ended up with was a compromise to hit its speed targets, and its update mechanism wasn’t great. It was significantly faster than LZMA/CABAC/etc.-style “update after every symbol” fully adaptive models, but with appreciably worse compression, and slower to decode than a fully static model would’ve been (which would’ve also enabled tANS usage). Successor codecs jettisoned this part of the design; Oodle LZNA (as the name suggests, a LZMA-derived design) used multi-symbol rANS but with full adaptation after every symbol. Kraken, Mermaid and Leviathan (but not Selkie, which is a byte-aligned LZ with no entropy coding) support multiple choices of entropy coders but they’re all based on static models.

BitKnits use of a semi-adaptive model makes it easy to do the modelling and most of the encoding in a single front-to-back pass. Static models are trickier to encode for because there are interdependencies between e.g. the LZ parse and the statistics of the symbols emitted, which makes encoding harder and somewhat slower, but such is life. The actual rANS bitstream needs to be encoded in reverse though (we prefer our decoders to consume data in increasing address order). To limit encoder space memory usage, BitKnit processes data in chunks of 64k (uncompressed) at a time; the rANS bitstream is flushed after every chunk, which adds a small bit of overhead but lets us use a guaranteed bounded amount of memory in the encoder for arbitrary input sizes, improves its memory access patterns, and is also handy from a file format perspective. (Old post about rANS in practice, derived from how we used it in BK then LZNA, here). All the Oodle formats are also internally chunked (it’s how the framing format works) and I can recommend this strategy in general.

Entropy coding experiments

While working on BitKnit, I experimented a lot with different entropy coding options, and much of it ended up in our other formats in some way or other. My blog post “Mixing discrete probability distributions” was written while working on BK and underlies both its semi-adaptive models and the follow-up “Models for adaptive arithmetic coding”. The latter are what’s used in Oodle LZNA, mostly with 8-, 9-, 16- or 17-symbol alphabets, using a SIMD decoder. This style of model with a larger radix than binary needs to sustain a much smaller number of symbol decodes per second than the equivalent binary arithmetic coder does, and especially combined with interleaved rANS coding gives quite a nice speed boost in the decoder. The caveat is that they’re usually not quite as good in terms of compression ratio as the “binary tree”-style adaptive binary models, but I consider it a great trade-off that even now, 8 years after writing it up, feels under-explored to me. (LZNA did use it; our follow-up formats didn’t simply because we’ve generally moved away from design points where adaptive coding makes sense entirely, since in our application space, decoder speed is very important.)

The actual models that ended up in LZNA were designed by my colleague Charles Bloom but it all started with BitKnit prototypes; ultimately, the adaptive coding direction was rejected as being too slow for the rates BK was targeting (not least because some of the platforms BK was meant to work on didn’t have suitable integer SIMD either), but it was a great fit for LZNAs slower target. Alas, it too didn’t last too long; Leviathan, which came out a few years later, usually gets quite close – sometimes better – at something like 10x the decode speed, using purely static models.

LZ literals

The main problem with batched model adaptation is that it takes quite a bit longer for the encoder (or decoder) to respond to changes in symbol statistics. LZMAs backend can pivot rapidly on such changes: when an unexpected (i.e. low-probability) symbol is seen, it’s probability is immediately boosted drastically. In my experiments, even batched updates that run every 2 symbols (i.e. update after every second symbol) did much worse than immediate updates/full adaptation on LZ literals, i.e. all the bytes not covered by string matches. Other types of symbols fared much better in my testing, but that’s of little use, because literals usually make up most of a compressed LZ data stream both by symbol count and by encoded size. If the literals have to be fully adaptive, that most of the runtime cost of full adaptation already, which as already mentioned was slower than what BK was aiming for.

Because BK was meant for Granny (a 3D mesh and animation exporter/runtime toolkit), most of the data I was looking at for mesh exports in particular was mesh vertex data, topology (in the form of index buffers), and textures. Vertex data is laid out in the form of fixed-size structs containing fields like vertex positions, normals, UV coordinates, and so forth. Meshes often have a fair bit of redundancy between vertices, where the bytes encoding positions of vertices frequently match positions of other nearby vertices, it’s common for authored vertex data to have meshes that are axis-aligned planar components that have a lot more structure still, and both the normals and UV coordinates in such cases tend to be highly repetitive as well, although usually with some variation. Indices are list of small integers that are usually “more or less” increasing (meaning the average value of the indices you see in a sliding window of recent indices will steadily go over up time, although there’s of course outliers). Some textures are quite noisy, and LZs don’t do great on those (you can improve things slightly with methods like PNGs filters). Some are “vector-y” with very geometric features and large flat areas, and LZs do much better on those. And some are “vector-y” with gradients, which again LZs don’t do awesomely on (it somewhat depends on gradient orientation), but they obviously have very nice structure in the byte values.

What most of these have in common is that delta coding (i.e. sending values as differences to a prior value, rather than directly) tends to work well on them. And from a different direction, LZMA (the main codec I was comparing against for BK) has a special case in its literal coding for the literal immediately following a match: the idea is that the decoder can be fairly sure that right after a match ends, the next literal in the data stream is not what the next byte after the final matched location would have been – otherwise we would just have sent a longer match to begin with. LZMA does not strictly enforce this, and the optimal parser can do such choices sometimes because it turns out to be advantageous, but the probability of the literal byte being the same as the match byte is still quite low. So what LZMA uses after a match is use a probability model for the XOR of the literal byte and the match byte.

This does two things: first, we expect that XOR to not be 0 with high likelihood, so even if everything else is completely random, we now end up a with an alphabet of 255 uniformly random symbols (with a very low likelihood of a 0), instead of 256 random symbols, which is slightly better even assuming nothing about the distribution of values at all. But second, the match byte and the literal byte are not uncorrelated in practice. The first literal after a match, especially in scenarios like ours where we have arrays of fixed-size records with very regular structure, is often a “near miss”, and the match byte still ends up being a good predictor. LZMA uses a bit-wise binary tree model, so XOR is a natural predictor for that topology.

BitKnit, however, does not; it uses byte-based coding, and as explained above we have a lot of data where we expect delta coding to do well. Hence, BitKnit sends the byte difference (mod 256) between the match byte and the literal byte. This worked great, and some experimentation later, it just ended up sending all literals this way, not just the first literal after a match – the first literal after a match sends the difference to the byte right after the match ended, the second literal after a match a difference to the byte after that, and so on.

This solved several problems at once: first, it acts as an automatic delta-coder for this kind of data, and is neatly integrated with the LZ so there is no fiddly setting up good byte distances to use for delta-ing or similar. Because these distances are determined by the match distances, it’s all automatic within the format. It’s not generally as good as a custom format-aware delta coder, but it tends to get a lot of the benefit and Just Works. Second, it turns out that (for the data I was looking at anyway) the delta-literal distribution was usually a lot less variable than the raw literal distribution, which was the key enabler for BitKnits use of semi-adaptive models throughout.

BitKnit itself always codes literals this way; this is not ideal for other types of data, but – somewhat surprisingly, to me anyway – worked decently even more text-based data like JSON or XML. Right around the time of BKs original release, Charles wrote this up in his post on what he ended up calling LZ-Sub (within BitKnit, the technique didn’t have a need for a name, it was just how the literals worked). Our later codecs Kraken, Mermaid and Leviathan all used this method and later expanded on it. Notably, none of the newer codecs go strictly one way or the other. Kraken and Mermaid work in chunks (128k of uncompressed data), and both of them have a per-chunk flag whether literals should be taken as raw values or as deltas from the match bytes. The delta decoding is slightly more expensive than just copying raw literals but does tend to give decent compression ratio improvements on many sources. As with essentially all other decisions in these codecs, the decision is made from an explicit space-speed trade-off (i.e. evaluate both compression ratio gains, if any, of using delta-literals, estimate decompression cost, and make the decision by weighing both factors according to a user-configurable trade-off). Leviathan goes further down this path and adds several other literal modes, so there’s not just two. Leviathan doesn’t have any notion of per-byte difference other than subtraction – we tried XOR but never found any compelling use – and instead can optionally include other types of context during its literal modeling.

The trade-off here is that the more complicated modes are both more expensive to decode and need to be evaluated by the encoder, which adds a bit to the encode-time cost.

For an example of how this works out in practice, you can look at for example Aras Pranckevičius’ recent (at time of writing) series of blog posts. In post 2 of the series he compares several of the Oodle codecs to a bunch of other competitors and finds out that we compare very well in terms of compression ratio, speed and decompression speed. The differences in his first test are as drastic as they are largely because Aras’ test set (structured binary data, namely several large-ish arrays of data items that are 4 floats each) plays exactly to our formats’ strengths. As you might notice, this is very similar to what you would see for the “arrays of vertex data” situation that was the design goal for BitKnit and prompted many of its design decisions. Kraken and Mermaid improved on BKs design considerably, but the delta-literals in particular come straight from BK and work very well for this kind of data. Later on in the series in part 6, Aras experiments with his own data-specific delta filtering, at which point the big differences between Zstd and Kraken essentially disappear (worth noting his results are exactly on the fast to medium part of the compression speed scale; Zstd gets better bang for the buck in that range, we tend to gain some more compression ratio at the slower settings, but overall I’d describe the two as comparable). In terms of absolute numbers, the delta literals in Kraken mean that Kraken at “Optimal1” (level 5) in the first test hits about 3.71:1 compression ratio while zstd level 18 (comparable encode speed) manages around 3.04:1. The latter test has Kraken Optimal1 around 4.34:1 and zstd level 15 (closest quoted) around 4.17:1, which I’m going to call roughly even. (We could make both codecs trade places here by playing around more with Zstd’s compression levels and Kraken’s configurable space/speed trade-off setting). In short, the delta literals manage to extract about half the gain a specialized difference filter does, with no manual data-specific work required.

Lots of repeat offsets

The main other fairly unique design choice in BitKnit is to have many repeat offset slots; where LZX, LZMA and Zstd (and also Kraken, for that matter) use 3, BitKnit uses an extravagant 7. As with everything else in BitKnit, this comes down to its genesis as a codec for a data format leaning towards fixed-size records. In short, the extra repeat offset slots are useful for the data it’s designed for, whereas for many other types of data they don’t really help. The extra repeat offsets rarely outright hurt compression, but maintaining them does take time in the decoder.

One weird wrinkle in BitKnit that we didn’t take forward into other formats is the replacement policy. The standard method is to use “move to front” coding, i.e. repeat offset slots are numbered from most to least recently used. With BitKnits many offset slots, I found that I got some slight wins by not inserting new offsets into the precious “0” (frontmost) slot; instead, new offsets got inserted into slot 2, and the only way for an offset to make it into slots 0 or 1 is to get reused at least once. This was a slight win in BitKnit but didn’t make it into any of our newer codecs; notably this was designed into BitKnit before I had an optimal (or optimizing anyway) parser working, and was a small but solid win with the heuristics-driven parse I was using in the encoder at the time. When we re-evaluated this idea for other formats later, we found that optimizing parsers are good at managing the recent offset slots anyway so there wasn’t much benefit to this. Leviathan did end up using 7 recent match slots (largely because it provided benefits on certain types of highly structured data, same as I saw in BK) but with standard MTF/LRU update rules. (I wrote up how BitKnit implements its offset history years ago.)

And that’s it! I’m not sure if these notes are valuable to anyone else, but BK was a fun little codec to work on (and it was little – decoder, heuristic encoder with multiple speed levels and optimizing encoder all fit into a 2700-line file C++ file, with extensive comments) and it led to many ideas that we elaborated on further in our follow-up codecs.

Read More