This is a traditional Quicksort
implementation which for the most part follows
Robert Sedgewick’s 1978 paper.
It is implemented as a C macro, which means that comparisons can be inlined.
A distinctive feature of this implementation is that it works entirely on array
indices, while actual access to the array elements is abstracted out with
the less and swap primitives provided by the caller. Here is an example
of how to sort an array of integers:

#include "qsort.h"
void isort(int A[], size_t n)
{
    int tmp;
#define LESS(i, j) A[i] < A[j]
#define SWAP(i, j) tmp = A[i], A[i] = A[j], A[j] = tmp
    QSORT(n, LESS, SWAP);
}

Since access to the actual array is so completely abstracted out,
the macro can be used to sort a few dependent arrays (which,
to the best of my knowledge, no other implementation can do):

#include "qsort.h"
int sortByAge(size_t n, const char *names[], int ages[])
{
    const char *tmpName;
    int tmpAge;
#define LESS(i, j) ages[i] < ages[j]
#define SWAP(i, j) tmpName  = names[i], tmpAge  = ages[i], 
                   names[i] = names[j], ages[i] = ages[j], 
                   names[j] = tmpName,  ages[j] = tmpAge
    QSORT(n, LESS, SWAP);
}

The sort is not stable
(this is inherent to most of Quicksort variants). To impose order among
the names with the same age, the LESS macro can be enhanced like this:

#define LESS(i, j) ages[i] <  ages[j] || 
                  (ages[i] == ages[j] && strcmp(names[i], names[j]) < 0)

This Quicksort implementation is written by Alexey Tourbin.
The source code is provided under the
MIT License.

Performance

A benchmark is provided which evaluates the performance
of a few implementations: libc’s qsort(3), STL’s std::sort (denoted
resp. stdlib and stl), Michael Tokarev’s
Inline QSORT() implementation,
and this implementation (denoted resp. mjt and svpv).
Michael Tokarev’s implementation is based on an older glibc’s version
of Quicksort. Modern glibc versions, including the one used below,
use merge sort.

A word of warning: this benchmark does only a tiny bit of averaging.
For conclusive evidence, the program needs to be run multiple times.

By default, the bench program sorts 1M random integers.

$ make
g++ -O2 -g -Wall -fwhole-program -o bench bench.cc
./bench
stdlib     402584990    19644762
stl        230878632    25344013
mjt        272302466    24316349
svpv       245908342    23287211

The STL implementation turns out to be the fastest (the first column
indicates the number of
RDTSC cycles),
despite the fact that it performs the largest number of comparisons
(the second column).

One reason my implementation comes in second to STL is some if its design
limitations. The swap macro issues three moves as a whole, while some
parts of the algorithm, notably
insertion sort,
can benefit from copying items to the right one position rather than doing
full exchanges. (It wouldn’t be enough to factor swap into save,
restore, and copy, though. After an item is saved to the temporary
register, it is further required to compare other items to that temporary
register, which less can’t do.)

Of course, to a considerable degree, performance depends on the compiler
being used. I found that my implementation is favoured by Clang, which
also disrespects std::sort (using -O3 doesn’t help). Sedgewick was
right when he said we exposed ourselves to the whims of compilers.

$ rm -f bench && make CXX=clang
clang -O2 -g -Wall -fwhole-program -o bench bench.cc
clang: warning: optimization flag '-fwhole-program' is not supported
./bench
stdlib     414620784    19644762
stl        321896126    25344013
mjt        270644434    24316349
svpv       233669286    23287211

Relevant to performance is another characteristic of my implementation:
it does not assume that comparisons are cheap, as with integers, and
deliberately tries to reduce the number of comparisons when it is easily
possible (specifically, during insertion sort, it does not trade boundary
checks for extra comparisons). This pays off when comparisons are
expensive, such as when comparing string keys with strcmp(3).
In the following example, I use filenames and dependencies from the RPM
database as the set of strings to be sorted, shuffling them with shuf(1).

$ rpm -qa --list --requires --provides | shuf >lines
$ wc -l 

My implementation comes in first now, and std::sort falls behind by
a big margin! Note that qsort(3) makes even fewer comparisons (this is
inherent to merge sort as opposed to any Quicksort variant), which makes
it come in second. Again, Clang favours my implementation even more.

In the above examples, GCC 6.3.1 and Clang 3.8.0 have been used
on a Haswell CPU.

Read More