28-Nov-2018: I’ve added a small update at the end on how to prevent
‘handle collisions’ with a per-slot generation counter

Original Post:

…wherein I talk a bit about how I’m doing dynamic memory management in C
and C++ these days which basically replaces raw- and smart-pointers with
‘index-handles’.

In my last blog post I was mentioning
pointer- and allocation-free programming, but was skipping over
the details. This is what the following blog post is
about.

This is all based on the (sometimes painful) experience of wrestling for 15+
years with fairly big C++ code bases (0.5 to around 1 mloc) where memory is
often managed through smart pointers. The worst case being tens- to
hundreds-of-thousands of small C++ objects, each in its own heap
allocation, pointing to each other through smart pointers. While such code
is quite robust in terms of memory corruption (segfaults and corruption
rarely happens, since most attempts are caught by asserts when dereferencing
smart pointers), this type of ‘object spiderweb code’ is also dog-slow
without obvious starting points for optimization, since the entire code is full of
cache misses. Other typical problems are memory fragmentation and ‘fake
memory leaks’ because a forgotten smart pointer prevents freeing the
underlying memory (I call them ‘fake leaks’ because this type of leaks cannot
be caught by memory debugging tools).

Nothing presented here is particularly new or clever, it’s just
a collection of a few simple ideas which together work fairly well in bigger
code bases to prevent (or at least detect early) a number of common
memory-related problems in C and C++, and may even be useful in higher-level
garbage-collected languages to reduce pressure on the garbage collector.

However, the underlying design philosophy doesn’t fit very well into a
classical OOP world where applications are built from small autonomous
objects interacting with each other. That’s why it’s also quite tricky to
implement those ideas in a big existing OOP code base, where object creation
and destruction happens ‘decentralized’ all over the code.

The approach described here works very well though with a data-oriented
architecture, where central systems work on arrays of data items packed
tightly in memory.

Most of the following blog post is written from a game developer’s
perspective, but should also apply to other areas where a program needs to
juggle a few hundred to a few million objects (or generally ‘data items’) in
memory, and where such items are created and destroyed frequently.

The gist is:

  • move all memory management into centralized systems (like rendering, physics, animation, …), with the systems being the sole owner of their memory allocations
  • group items of the same type into arrays, and treat the array base pointer as system-private
  • when creating an item, only return an ‘index-handle’ to the outside world, not a pointer to the item
  • in the index-handles, only use as many bits as needed for the array index, and use the remaining bits for additional memory safety checks
  • only convert a handle to a pointer when absolutely needed, and don’t store the pointer anywhere

I’ll explain each of those points in detail below. But the idea is basically
that common ‘user-level’ code doesn’t directly call memory allocation
functions (like malloc, new or make_shared/make_unique), and reduces the use
of pointers to an absolute minimum (only as short-lived references when
direct memory access to an item is absolutely needed). Most importantly,
pointers are never the ‘owner’ of an item’s underlying memory.

Instead, direct memory manipulation happens as much as possible inside a few
centralized systems where memory-related problems are easier to debug and
optimize.

Move memory management into central systems

In this blog post, a ‘system’ is a (usually fairly big) part of a code base
which takes care of a number of related tasks, like ‘rendering’, ‘physics’,
‘AI’, ‘character animation’ and so on. Such a system is separated from other
systems and ‘user-code’ through a clearly defined function API, and the work
that happens on the system’s data is performed in tight central loops instead
of being spread out all over the code base.

Systems often work on items created and destroyed under control of user code
(but note that creation and destruction of items is different from allocating
and freeing the memory used by those items!). For instance a rendering system
might deal with vertex buffers, textures, shaders and pipeline state objects.
A physics system works with rigid bodies, joints and collision primitives,
and an animation system works with animation keys and curves.

It makes sense to move the memory management for such items into the systems
themselves, because a general memory allocator doesn’t have the
system-specific ‘domain knowledge’ about how data items are processed and the
relationships between data items. This allows the system to optimize memory
allocations, perform additional validation checks when creating and
destroying items, and arrange items in memory for making best use of the
CPU’s data caches.

A good example for this ‘system domain knowledge’ is the destruction of
rendering resource objects with modern 3D APIs: a resource object can not
simply be destroyed immediately when the user code says so because the
resource might still be referenced in a command list waiting to be consumed
by the GPU. Instead the rendering system would only mark the resource object
for destruction when the user code requests its destruction, but the actual
destruction happens at a later time when the GPU no longer uses the resource.

Group items of the same type into arrays

Once all memory management has moved into systems, the system can optimize
memory allocations and memory layout with its additional knowledge about how
items are used. One obvious optimization is to reduce the number of general
memory allocations by grouping items of the same type into arrays, and
allocate those arrays at system startup.

Instead of performing a memory allocation each time a new item is created,
the system keeps track of free array slots, and picks the next free slot.
When the user code no longer needs the item, the system simply marks the slot
as free again instead of performing a deallocation (not different
from a typical pool allocator).

This pool allocation is most likely a little bit faster than performing a
memory allocation per item, but this is not even the main reason for keeping
items in arrays (modern general allocators are quite fast too for small
allocations).

A few additional advantages are:

  • items are guaranteed to be packed tightly in memory, general allocators sometimes need to keep some housekeeping data next to the actual item memory
  • it’s easier to keep ‘hot items’ in continuous memory ranges, so that the CPU can make better use of its data caches
  • it’s also possible to split the data of a single item into several subitems in different arrays for even tighter packing and better data cache usage (AoS vs SoA and everything inbetween), and all those data layout details remain private to the system and are trivial to change without affecting ‘outside code’
  • as long as the system doesn’t need to reallocate arrays, it is guaranteed that there will be no memory fragmentation (although this is less of an issue in a 64-bit address space)
  • it’s easier to detect memory leaks early, and provide more useful error messages: when a new item is created a system can trivially check the current number of items against an expected upper bound (for instance a game might know that there should never be more than 1024 textures alive at a time, and since all textures are created through the rendering system, the system can print out a more useful warning message when this number is exceeded)

Public index-handles instead of pointers

Keeping system items in arrays instead of unique allocations has the
advantage that an item can be identified through an array index instead of
requiring a full pointer. This is very useful for memory safety.
Instead of handing memory pointers to the outside world, the system can treat
the array base pointers as ‘private knowledge’, and only hand out array
indices to the public. Without the base pointer to compute an item’s memory
location, the outside code can’t access the item’s memory, even with a lot
of criminal energy.

This has a number of further advantages:

  • In many situations, code outside the system never even needs to directly access memory of an item, only the system does. In such an ‘ideal’ situation, user code never accesses memory through pointers, and can never cause memory corruption.
  • Since only the system knows the array base pointers, it’s free to move or reallocate the item arrays at will without invalidating existing index handles.
  • Array indices need fewer bits than full pointers, and a smaller data type can be picked for them, which in turn allows tighter packing of data structures and better data cache usage (this has the caveat that additional handle bits can be used to increase memory safety, more about this below)

If user code needs to access the memory of an item directly it needs to obtain
a pointer through a ‘lookup function’ which takes a handle as input and returns
a pointer. As soon as such a lookup function exists, the fairly watertight
memory safety scenario outlined above is no longer guaranteed, and the user code
should adhere to a few rules:

  • pointers should never be stored anywhere, since the next time the pointer is used it may no longer point to the same item, or even to valid memory
  • a pointer should only be used in a simple code block, and not ‘across’ function calls

Every time a handle is converted into a pointer, the system can guarantee
that the returned pointer still points to the same item that the handle was
originally created for (more on this below), but this guarantee ‘decays’ over
time since the item under the pointer may have been destroyed, or the underlying
memory may have been reallocated to a different location (this is the same
problem as iterator invalidation in C++).

The two simple rules above are easy to memorize and are a good compromise between
not exposing pointers to user code at all, and having a (somewhat costly)
handle-to-pointer conversion for every single memory access.

Memory safety considerations

First, each type of handle should get its own C/C++ type, so that attempting
to pass the wrong handle type to a function is already detected at compile
time (note that a simple typedef isn’t enough to produce a compiler warning,
the handle must be wrapped into its own struct or class – however this could
be limited to debug compilation mode).

All runtime memory safety checks happen in the function which converts a
handle into a pointer. If the handle is just an array index, it would look
like this:

  • a range check happens for the index against the current item array size, this prevents segmentation faults and reading or writing allocated but unrelated memory areas
  • a check needs to happen whether the indexed array item slot contains an active item (is not currently a ‘free slot’), this prevents the simple variant of ‘use after free’
  • finally the item pointer is computed from the private array base pointer and public item index

The resulting pointer is safe to use as long as:

  • the item array isn’t reallocated
  • the indexed item hasn’t been destroyed

Both can only happen inside one of the system’s functions, that’s why the two
‘pointer usage rules’ exist (don’t store pointers, don’t keep pointers across
function calls).

There is a pretty big hole in the above use-after-free check though: if we only
check if an array slot behind an index-handle contains a valid item, it’s
not guaranteed that it is the same item the handle was originally created
for. It can happen that the original item was destroyed, and the same
array slot was reused for a new item.

This is where the ‘free bits’ in a handle come in: Let’s say our
handles are 16-bits, but we only ever need 1024 items alive
at the same time. Only 10 index bits are needed to address 1024 items, which
leaves 6 bits free for something else.

If those 6 bits contain some sort of ‘unique pattern’, it’s possible to
detect dangling accesses:

  • When an item is created, a free array item is picked and its index is put into the lower 10 handle bits. The upper 6 bits are set to a ‘unique bit pattern’
  • The resulting 16-bit handle (10 bits index + 6 bit ‘unique pattern’) are returned to the outside world, and at the same time stored with the array slot.
  • When an item is destroyed, the item handle stored with the array slot is set to the ‘invalid handle’ value (can be zero, as long as zero is never returned as a valid handle to the outside world)
  • When the handle is converted to a pointer, the lower 10 bits are used as array index to lookup the array slot, and the entire 16-bit handle is compared against the handle that’s currently stored with the array slot:
    • if both handles are equal, the pointer is valid and points to the same item the handle was created for
    • otherwise this is a dangling access, the slot item has either been destroyed (in that case the stored handle would have the ‘invalid handle’ value), or has been destroyed and reused for a new item (in that case the upper 6 ‘unique pattern’ bits don’t match)

This handle-comparison check when converting a handle to a pointer works
quite well to detect dangling-accesses, but it isn’t waterproof because the
same combination of array index and ‘unique pattern’ will be created sooner or
later.
But it’s still better than no dangling protection at all (like raw pointers),
or a ‘fake memory leak’ which would happen in similar situations with smart pointers.

Finding good strategies to create unique handles that collide as rarely as
possible is the most important part of course, and left as an excercise to
the reader ;P

Obviously it’s good to use as many bits for the unique pattern as possible, and the
way how free array slots are reused is important as well (e.g. LIFO vs FIFO). It’s
probably also good to write a little creation/destruction stress-test which
checks for handle collisions and can be used to tweak the unique-pattern creation for
specific use cases. A system where items are created and destroyed with a very
high frequency needs more effort (or simply more handle bits) than systems
where item creation and destruction is a rare occurance.

Other useful properties of handles

Apart from the whole memory safety aspect, handles are also useful for other
situations where pointers are problematic:

Handles can be used as shared object identifiers across processes (all you need is
some sort of ‘create_item_with_handle()’ function, which doesn’t create a
new handle, but takes an existing handle as input argument). This is especially
useful for online games where handles can be shared between the server and
all clients in a game session, or in savegame systems to store references
to other objects.

Sometimes it’s useful to create a whole group of related items (for instance
animation keys and curves), and reference the whole item group with a single
handle. In that case some sort of ‘range handle’ can be used, which contains
not only an index (of the first item), but also the number of items in the range.

In some situations it’s also useful to reserve a few handle bits for an item
type if static type checking at compile time isn’t sufficient.

In conclusion, I find it quite surprising how naturally and elegantly handles
solve many problems I encountered in the past with the traditional ‘pointers
to objects on the heap’ model, and how little I miss this model now (and the
parts of C++ that are built around it).

Some real-world examples

The sokol-gfx API is an example of a C-API which uses handles instead of
pointer of rendering resource objects (buffers, images, shaders, …):

sokol_gfx.h

The Oryol Gfx module is a similar 3D API wrapper, but written in C++:

Oryol Gfx Module

The Oryol Animation extension module is a character animation system which keeps all its data in arrays:

Oryol Anim Module

Update 28-Nov-2018

…in the post above I was sort of handwaving away the problem of creating
the same unique-tag twice for the same slot, and a nice person on twitter
hinted me about a very simple, elegant and embarrassingly ‘obvious’ solution:

Each array slot gets its own generation counter, which is bumped when a
handle is released (can also happen when the handle is created, but bumping
on release means you don’t need a reserved value for “free slots” to detect
an invalid handle).

To check if a handle is valid, simply compare its unique-tag with the current
generation counter in its slot.

Once the generation counter would ‘overflow’, disable that array slot, so
that no new handles are returned for this slot.

This is a perfect solution for avoiding handle collisions, but the handles
will eventually run out since all array slots will be disabled eventually. But
since each slot has its own counter, this only happens after all
handle-bits are exhausted, not just the few unique-tag bits.

So with 32-bit handles, you can always create 4 billion items, with at most
2^(32 – num_counter_bits) alive at the same time. This also means the number of
bits for the unique-tag can be reduced without compromising ‘handle safety’.

It may also be possible to re-activate disabled slots once it can be
guaranteed that no more handles for that slot are out in the wild
(maybe at special places in the code like entering or exiting a level).

Read More