June 12th, 2023 @ justine’s web page

[DeepMind Logo]

A few days ago, DeepMind published a
blog
post
talking about a
paper
they wrote, where they discovered tinier kernels for sorting algorithms.
They did this by taking their deep learning wisdom, which they gained by
building AlphaGo, and applying it to the discipline of
of superoptimization.
That piqued my interest, since as a C library author, I’m always looking
for opportunities to curate the best stuff. In some ways that’s really
the whole purpose of the C library. There are so many functions that we
as programmers take for granted, which are the finished product of
decades of research, distilled into plain and portable code.

DeepMind earned a fair amount of well-deserved attention for this
discovery, but unfortunately they could have done a much better job
explaining it. Let’s start with the assembly code they published for
sorting an array with three items, translated from pseudo-assembly into
assembly:

/	move37.S
	.equ	P,%rax
	.equ	Q,%rcx
	.equ	R,%rdx
	.equ	S,%rsi
move37:	mov	(%rdi),P
	mov	8(%rdi),Q
	mov	16(%rdi),R
	mov	R,S
	cmp	P,R
	cmovg	P,R
	cmovl	P,S
	cmp	S,Q
	cmovg	Q,P
	cmovg	S,Q
	mov	R,(%rdi)
	mov	Q,8(%rdi)
	mov	P,16(%rdi)
	ret
	.type	move37,@function
	.size	move37,.-move37
	.globl	move37
// deepsort1.c
#include 

void move37(long *);

int main() {
  long A[3] = {3, 1, 2};
  move37(A);
  printf("%d %d %dn", A[0], A[1], A[2]);
}

I named this function move37() because the DeepMind blog
post compares it to the 37th move AlphaGo made during its second
match
with Lee Sedol
back in 2016. That’s the move which stunned the
experts who thought AlphaGo had made a mistake, but they were wrong,
because the machine ultimately did achieve
victory
against its opponent, who saw himself as the Hector of humanity. So if
we run the DeepMind code:

# run this on the shell
cc -o deepsort1 deepsort1.c move37.S
./deepsort1
2 1 3

That looks like a mistake to me. The array we gave it was {3, 1, 2} but
move37() sorted it into {2, 1, 3}. DeepMind must be
trolling us, because I don’t believe that 2 comes before 1. Let’s take a
look at the
open source contribution they
made to LLVM libcxx
which should hopefully clarify things:

// Ensures that *__x, *__y and *__z are ordered according to the comparator __c,
// under the assumption that *__y and *__z are already ordered.
template <class _Compare, class _RandomAccessIterator>
inline _LIBCPP_HIDE_FROM_ABI void __partially_sorted_swap(
    _RandomAccessIterator __x, _RandomAccessIterator __y,
    _RandomAccessIterator __z, _Compare __c) {
  using value_type = typename iterator_traits<_RandomAccessIterator>::value_type;
  bool __r = __c(*__z, *__x);
  value_type __tmp = __r ? *__z : *__x;
  *__z = __r ? *__x : *__z;
  __r = __c(__tmp, *__y);
  *__x = __r ? *__x : *__y;
  *__y = __r ? *__y : __tmp;
}

Now it makes sense. So move37() isn’t actually a sorting
function. It’s a sorting kernel that’s intended to be used as
the building block of the sort3() function. It would have
been nice if the paper and blog post had mentioned that, since it made
me feel quite confused for the briefest of moments. Here’s a better
version of the code, which includes the missing swap operation.

sort3:	mov	(%rdi),%rcx
	mov	8(%rdi),%rdx
	mov	16(%rdi),%rsi
	mov	%rdx,%rax
	cmp	%rdx,%rsi
	cmovl	%rsi,%rax
	cmovl	%rdx,%rsi
	mov	%rcx,%rdx
	cmp	%rcx,%rsi
	cmovl	%rsi,%rdx
	cmovl	%rcx,%rsi
	cmp	%rax,%rdx
	cmovge	%rax,%rcx
	cmovl	%rax,%rdx
	mov	%rcx,(%rdi)
	mov	%rdx,8(%rdi)
	mov	%rsi,16(%rdi)
	ret
	.globl	sort3
	.size	sort3,.-sort3

To explain why their code is important, let’s consider how this
algorithm works at a high level. When I first tried to solve
the sort3() problem on my own, I came up with this:

	// sorts [a,b,c]
	if (a > b) SWAP(a, b);
	if (a > c) SWAP(a, c);
	if (b > c) SWAP(b, c);

Then I looked at libcxx and found out they were doing the same thing.
The issue with the code above is that compilers aren’t very good at
optimizing it. If you try to compile the above code, you’ll notice that
your compiler inserts a lot of branch instructions. This is what
DeepMind sought to improve upon with their LLVM contribution, because
cleverer ways exist to code this sort of thing. However those techniques
tend to be less understandable. I actually like my naive code, since if
we squint our eyes a bit, we can see a pattern exists where it shares
the same basic idea as DeepMind’s state of the art assembly code. That
idea is this problem essentially boils down into being just three
compare and swap operations:

	mov	%rdx,%rax		// create temporary
	cmp	%rdx,%rsi		// compare
	cmovl	%rsi,%rax		// conditional move
	cmovl	%rdx,%rsi		// conditional move
/	repeat thrice

The above code was the state of the art beforehand for sorting networks.
Now here’s where DeepMind’s novel discovery came into play. They found
out that sometimes the mov instruction above is
unnecessary.

sort3:	mov	(%rdi),%rcx
	mov	8(%rdi),%rdx
	mov	16(%rdi),%rsi
	mov	%rdx,%rax
	cmp	%rdx,%rsi
	cmovl	%rsi,%rax
	cmovl	%rdx,%rsi
	mov	%rcx,%rdx
	cmp	%rcx,%rsi
	cmovl	%rsi,%rdx
	cmovl	%rcx,%rsi
	mov	%rdx,%rcx		// <-- wrekt by AlphaDev
	cmp	%rax,%rdx
	cmovge	%rax,%rcx
	cmovl	%rax,%rdx
	mov	%rcx,(%rdi)
	mov	%rdx,8(%rdi)
	mov	%rsi,16(%rdi)
	ret

If you try running the above code, then you’ll see it’s 100% correct
with or without the striked-out line. The line of code looks like it’s
doing something, but it’s actually doing nothing. So it doesn’t surprise
me that something like this could have gone unnoticed by computer
science for decades. It should also become clearer now how AlphaDev
works. DeepMind basically built an artificial intelligence that fiddles
around with assembly code and deletes stuff at random to see if it
breaks. I’m not saying this to dismiss AlphaDev’s intelligence, since
I’d be lying if I said I wasn’t doing the same thing.

sort3:	mov	(%rdi),%rcx
	mov	8(%rdi),%rdx
	mov	16(%rdi),%rsi
	mov	%rdx,%rax		// can it go?
	cmp	%rdx,%rsi
	cmovl	%rsi,%rax
	cmovl	%rdx,%rsi
	mov	%rcx,%rdx		// can it go?
	cmp	%rcx,%rsi
	cmovl	%rsi,%rdx
	cmovl	%rcx,%rsi
	mov	%rdx,%rcx		// <-- wrekt by AlphaDev
	cmp	%rax,%rdx
	cmovge	%rax,%rcx
	cmovl	%rax,%rdx
	mov	%rcx,(%rdi)
	mov	%rdx,8(%rdi)
	mov	%rsi,16(%rdi)
	ret

DeepMind also left some meat on the table. There’s still two more
mov instructions in the above code we could potentially
shave away. One of the ways we might do that is by using the ARM64
instruction set, which yields tinier code for problems like these. Here
we see that we don’t need any instructions for creating temporary
variables:

sort4:	ldp	x1,x2,[x0]
	ldr	x3,[x0,16]
	cmp	x2,x3
	csel	x4,x2,x3,le
	csel	x2,x2,x3,ge
	cmp	x2,x1
	csel	x3,x2,x1,le
	csel	x2,x2,x1,ge
	cmp	x4,x3
	csel	x5,x1,x4,gt
	csel	x4,x4,x3,ge
	stp	x5,x4,[x0]
	str	x2,[x0,16]
	ret

Arm is all the rage these days, and I imagine the example above serves
as evidence of why they’ve earned their fame. Arm Limited is also one of
the most benevolent companies in open source right now. For example,
their MbedTLS library
is one of the most underrated gems I’ve seen so far. When I started
using it, I originally had this scheme of modifying Arm’s code to work
better on x86 hardware instead. I wrote all these tricked out assembly
optimizations bringing it to the same realm of performance as OpenSSL on
x86. MbedTLS is plain, portable, hackable C code, so this is good news
for anyone wanting a crypto library that isn’t assembly generated by
Perl. I told the folks at Arm what I was doing, and instead of finding
it subversive they were very kind and encouraging, since that’s the kind
of people they are. One day I hope to find time to do what DeepMind did,
and upstream my modifications. Arm is also prolific for
their Optimized
Routines
library, which is up there with
double-conversion
in terms of impeccable quality. It’s of particular interest to C
libraries, since for decades the open source community has subsisted off
math functions written by Sun Microsystems back in the early 90’s. Arm
found a way to improve upon several of them, such
as pow(x,y). That’s a very high impact thing to do,
considering it’s a one of the most fundamental operations in math. For
example, if you use Arm’s solution in pure software to
implement pow(x,y) on an x86 machine, then it’ll go 5x
faster than Intel’s native x87 instructions for doing the same thing.

Since we’re so fortunate to have DeepMind entering this game too, I’ve
taken the liberty of translating their libcxx diff into plain readable C
code, so that everyone can appreciate its beauty. It’s another thing I
would have liked to see included in the paper and blog post, because in
this code you’ll find the canonical trick that experts use for getting
compilers to generate branchless MOVcc instructions.

// sorts [a,b]
static inline void Sort2(long *a, long *b) {
  int r = *a < *b;
  long t = r ? *a : *b;
  *b = r ? *b : *a;
  *a = t;
}

// sorts [a,b,c] assuming [b,c] is already sorted
static inline void PartialSort3(long *a, long *b, long *c) {
  int r = *c < *a;
  long t = r ? *c : *a;
  *c = r ? *a : *c;
  r = t < *b;
  *a = r ? *a : *b;
  *b = r ? *b : t;
}

// sorts [a,b,c]
static inline void Sort3(long *a, long *b, long *c) {
  Sort2(b, c);
  PartialSort3(a, b, c);
}
// sorts [a,b,c,d]
static inline void Sort4(long *a, long *b, long *c, long *d) {
  Sort2(a, c);
  Sort2(b, d);
  Sort2(a, b);
  Sort2(c, d);
  Sort2(b, c);
}
// sorts [a,b,c,d,e]
static inline void Sort5(long *a, long *b, long *c, long *d, long *e) {
  Sort2(a, b);
  Sort2(d, e);
  PartialSort3(c, d, e);
  Sort2(b, e);
  PartialSort3(a, c, d);
  PartialSort3(b, c, d);
}

Once I saw the Sort5() function, I felt like I had gained a
better understanding of what had motivated DeepMind’s research. If you
compile the Sort5() function on ARM64, then your compiler
will produce a function juggling 11 registers. If you were reasoning
about a mathematical equation, then would you be able to hold eleven
variables in your working memory at once? Probably not, which is why
having a nice kernel function like PartialSort3 is so
useful. As sentient creatures, human beings aren’t that much different
from the monkeys we once were. The main thing that makes us intelligent
is our ability to take hard problems and break them down into smaller
ones. So it’s nice to see deep learning being applied to turbocharging
our abstractions.

It’s also worth mentioning that Sort3()
and Sort5() are kernels themselves, since they’re intended
to be building blocks for a conventional sorting function. The blog post
covers this topic, but I thought it’d be useful to share something
that’s actually portable and executable.

static inline void InsertionSort(long *A, long n) {
  long i, j, t;
  for (i = 1; i < n; i++) {
    t = A[i];
    j = i - 1;
    while (j >= 0 && A[j] > t) {
      A[j + 1] = A[j];
      j = j - 1;
    }
    A[j + 1] = t;
  }
}

void longsort(long *A, long n) {
  long t, p, i, j;
  switch (n) {
    case 0:
      return;
    case 1:
      return;
    case 2:
      return Sort2(A, A + 1);
    case 3:
      return Sort3(A, A + 1, A + 2);
    case 4:
      return Sort4(A, A + 1, A + 2, A + 3);
    case 5:
      return Sort5(A, A + 1, A + 2, A + 3, A + 4);
    default:
      if (n <= 32) {
        InsertionSort(A, n);
      } else {
        for (p = A[n >> 1], i = 0, j = n - 1;; i++, j--) {
          while (A[i] < p) i++;
          while (A[j] > p) j--;
          if (i >= j) break;
          t = A[i];
          A[i] = A[j];
          A[j] = t;
        }
        LongSort(A, i);
        LongSort(A + i, n - i);
      }
      break;
  }
}

The above algorithm shows what the new and improved libcxx is doing.
It’s basically quicksort except it switches to the sorting kernels and
insertion sort when recursing into smaller slices. With libcxx I think
they even took the added step of schlepping in heapsort, which is kind
of slow, but prevents adversaries from smashing your stack.

The main thing you may be wondering at this point is, can I use this? Do
these sorting network kernels actually make sorting go faster? I would
say yes and no. When all you want is to sort ascending longs, the code
above will go 2x faster than the standard qsort() function
provided by your C library. Except you don’t need the kernels to do
that. What I’ve determined so far is that, on my personal computer
(which has an Intel Core i9-12900KS) the above function sorts longs at
255 megabytes per second. However if I comment out the sorting kernels:

void longsort(long *A, long n) {
  long t, p, i, j;
  switch (n) {
    case 0:
      return;
    case 1:
      return;
    /* case 2: */
    /*   return Sort2(A, A + 1); */
    /* case 3: */
    /*   return Sort3(A, A + 1, A + 2); */
    /* case 4: */
    /*   return Sort4(A, A + 1, A + 2, A + 3); */
    /* case 5: */
    /*   return Sort5(A, A + 1, A + 2, A + 3, A + 4); */
    default:
      if (n <= 32) {
        InsertionSort(A, n);
      } else {
        for (p = A[n >> 1], i = 0, j = n - 1;; i++, j--) {
          while (A[i] < p) i++;
          while (A[j] > p) j--;
          if (i >= j) break;
          t = A[i];
          A[i] = A[j];
          A[j] = t;
        }
        LongSort(A, i);
        LongSort(A + i, n - i);
      }
      break;
  }
}

Then my longsort() function goes 275 megabytes per second.
I achieved 7% better performance by dumbing down the algorithm. I say
let Intel do the damage. This is the function cosmopolitan libc uses to
sort elf symbol tables when loading your executables. The nice thing
about long is it’s long enough to store an int
key value pair. Being able to sort map entries fast is useful trick to
have. Blending naive quicksort with naive insertion sort is the best
solution I’ve found so far, in terms of striking a balance between
tininess and performance that’s just right for my use case. The above
function compiles down to just 181 bytes of x86-64 machine code. Since
DeepMind’s sort3() is only 42 bytes, I was hoping I could
trade away some size to gain a performance advantage. Since the next
best algorithm I’ve found so far would be switching to radix sort, which
goes 400 MB/s but needs a whopping 763 bytes of binary footprint, in
addition to depending on malloc(). So it would have been
nice to see these kernels do better.

That’s not to imply DeepMind’s ideas don’t have merit. I think it’s
important to note that DeepMind was generous enough to give us their
vectorized
quicksort
library last year (back when they were called Google
Brain) and in doing so achieved a sorting supremacy that can never be
challenged. Vqsort literally sorts longs at 1155 MB/s on my computer. It
even marginally outperforms
djbsort, which is one of the
most beloved libraries in the open source community even though it never
generalized to more data types than int. The way both
implementations pulled it off is by vectorizing sorting networks. I
think that’s where the sorting network technique truly shines. I imagine
AlphaDev would have done that if it weren’t still a toddler as far as
intelligent entities go. When you’re starting from first principles, the
baseline instruction set alone is enormously difficult to support. If we
wait, then I think we can expect to see great things from AlphaDev in
the future, as it works its way up to more formidable challenges.

I also just love the fact that DeepMind is making algorithms tinier,
since that’s something I don’t see very often. Size coding is one of my
favorite hobbies. On this blog I’ve published
a 383 byte virtual machine for
lambda calculus
and
a lisp machine with garbage
collection in 436 bytes
. I’ve also blogged about
the size optimization
tricks
I’ve used for
the cosmpolitan c
library
. I love DeepMind’s parent company too, since Google
awarded
me an open source peer bonus
a few weeks ago, and it’s nice to see
them sharing my passion for making software smaller. It’d be great to
see them using it to improve vectorized quicksort. I would love nothing
more than to have the world’s best longsort not be a C++ monster that
adds 24kB of binary footprint. It’s 23,000 lines of assembly for
ascending long sort alone. Those are the lines of code I can’t wait to
see AlphaDev ravage someday.

Finally, I like the idea of an artificial intelligence company building
machines that write code in machine language. Why wouldn’t they? It’s in
a machine’s nature to be a machine. As a builder I find that much less
triggering than the future OpenAI is creating, where they’ve built a
great big paternalistic machine that competes with every builder on
earth in a zero-sum economy, and then enticing the rent-seekers of the
world to take control of that machine through government regulation. I
don’t think it’s progress that OpenAI is promising to automate all the
tasks I love doing most, like coding. What I want is to be able to
control a machine that’s able to do the things I’m not able to do on my
own, like discovering sorting kernels. That’s the real kind of progress
that’s going to help humanity enrich itself, and I view every line of
assembly we can shave away as a step in a positive direction towards
that dream.

Read More