As a programmer, you use hash functions every day. They’re used in databases
to optimise queries, they’re used in data structures to make things faster,
they’re used in security to keep data safe. Almost every interaction you have
with technology will involve hash functions in one way or another.

Hash functions are foundational, and they are everywhere.

But what is a hash function, and how do they work?

In this post, we’re going to demystify hash functions. We’re going to start by
looking at a simple hash function, then we’re going to learn how to test if a
hash function is good or not, and then we’re going to look at a real-world use
of hash functions: the hash map.

This article has visualisations that can be clicked.

#
What is a hash function?

Hash functions are functions that take an input, usually a string, and produce a
number. If you were to call a hash function multiple times with the same input,
it will always return the same number, and that number returned will always be
within a promised range. What that range is will depend on the hash function,
some use 32-bit integers (so 0 to 4 billion), others go much larger.

If we were to write a dummy hash function in JavaScript, it might look like
this:

function hash(input) {
  return 0;
}

Even without knowing how hash functions are used, it’s probably no surprise
that this hash function is useless. Let’s see how we can measure how good a
hash function is, and after that we’ll do a deep dive on how they’re used
within hash maps.

#
What makes a hash function good?

Because input can be any string, but the number returned is within some
promised range, it’s possible that two different inputs can return the same
number. This is called a “collision,” and good hash functions try to minimise
how many collisions they produce.

It’s not possible to completely eliminate collisions, though. If we wrote a hash
function that returned a number in the range 0 to 7, and we gave it 9 unique
inputs, we’re guaranteed at least 1 collision.

hash("to") == 3
hash("the") == 2
hash("café") == 0
hash("de") == 6
hash("versailles") == 4
hash("for") == 5
hash("coffee") == 0
hash("we") == 7
hash("had") == 1

Output values from a well-known hash function, modulo 8. No matter what 9
values we pass, there are only 8 unique numbers and so collisions are
inevitable. The goal is to have as few as possible.

To visualise collisions, I’m going to use a grid. Each square of the grid is
going to represent a number output by a hash function. Here’s an example 8×2
grid. Click on the grid to increment the
example hash output value and see how we map it to a grid square. See what
happens when you get a number larger than the number of grid squares.


13 % 16 == 13

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Every time we hash a value, we’re going to make its corresponding square on the
grid a bit darker. The idea is to create an easy way to see how well a hash
function avoids collisions. What we’re looking for is a nice, even distribution.
We’ll know that the hash function isn’t good if we have clumps or patterns of
dark squares.

You said that when a hash function outputs the same value for 2 different
inputs, that’s a collision. But if we have a hash function that outputs values
in a big range, and we mapped those to a small grid, aren’t we going to create
lots of collisions on the grid that aren’t actually collisions? On our 8×2
grid, 1 and 17 both map to the 2nd square.

This is a great observation. You’re absolutely right, we’re going to be creating
“pseudo-collisions” on our grid. It’s okay, though, because if the hash function
is good we will still see an even distribution. Incrementing every square by 100
is just as good a distribution as incrementing every square by 1. If we have a
bad hash function that collides a lot, that will still stand out. We’ll see
this shortly.

Let’s take a larger grid and hash 1,000 randomly-generated strings. You can
click on the grid to hash a new set of random
inputs, and the grid will animate to show you each input being hashed and placed
on the grid.

The values are nice and evenly distributed because we’re using a good,
well-known hash function called murmur3. This hash
is widely used in the real-world because it has great distribution while also
being really, really fast.

What would our grid look like if we used a bad hash function?

function hash(input) {
  let hash = 0;
  for (let c of input) {
    hash += c.charCodeAt(0);
  }
  return hash % 1000000;
}

This hash function loops through the string that we’re given and sums the
numeric values of each character. It then makes sure that the value is between 0
and 1000000 by using the modulus operator (%). Let’s call this hash function
stringSum.

Here it is on the grid. Reminder, this is 1,000 randomly generated strings that
we’re hashing.

This doesn’t look all that different from murmur3.
What gives?

The problem is that the strings we’re giving to be hashed are random. Let’s see
how each function performs when given input that is not random: the numbers from
1 to 1000 converted to strings.

Now the problem is more clear. When the input isn’t random, the output of stringSum forms a pattern. Our murmur3 grid, however, looks the same as how it looked with
random values.

How about if we hash the top 1,000 most common English words:

It’s more subtle, but we do see a pattern on the stringSum grid. As usual, murmur3
looks the same as it always does.

This is the power of a good hash function: no matter the input,
the output is evenly distributed. Let’s talk about one more way to visualise
this and then talk about why it matters.

#
The avalanche effect

Another way hash functions get evaluated is on something called the “avalanche
effect.” This refers to how many bits in the output value change when just a
single bit of the input changes. To say that a hash function has a good
avalanche effect, a single bit flip in the input should result in an average of
50% the output bits flipping.

It’s this property that helps hash functions avoid forming patterns in the grid.
If small changes in the input result in small changes in the output, you get
patterns. Patterns indicate poor distribution, and a higher rate of collisions.

Below, we are visualising the avalanche effect by showing two 8-bit binary
numbers. The top number is the input value, and the bottom number is the murmur3 output value. Click on it to flip a single bit
in the input. Bits that change in the output will be green, bits that stay the same will be red.

murmur3 does well, though you will notice that
sometimes fewer than 50% of the bits flip and sometimes more. This is okay,
provided that it is 50% on average.

Let’s see how stringSum performs.

Well this is embarassing. The output is equal to the input, and so only a single
bit flips each time. This does make sense, because stringSum just sums the numeric value of each character in the
string. This example only hashes the equivalent of a single character, which
means the output will always be the same as the input.

#
Why all of this matters

We’ve taken the time to understand some of the ways to determine if a hash
function is good, but we’ve not spent any time talking about why it matters.
Let’s fix that by talking about hash maps.

To understand hash maps, we first must understand what a map is. A map is a data
structure that allows you to store key-value pairs. Here’s an example in
JavaScript:

let map = new Map();
map.set("hello", "world");
console.log(map.get("hello"));

Here we take a key-value pair ("hello""world") and store it in the map.
Then we print out the value associated with the key "hello", which will be
"world".

A more fun real-world use-case would be to find anagrams. An anagram is when two
different words contain the same letters, for example “antlers” and “rentals”
or “article” and “recital.” If you have a list of words and you want to find
all of the anagrams, you can sort the letters in each word alphabetically and
use that as a key in a map.

let words = [
  "antlers",
  "rentals",
  "sternal",
  "article",
  "recital",
  "flamboyant",
]

let map = new Map();

for (let word of words) {
  let key = word
    .split('')
    .sort()
    .join('');

  if (!map.has(key)) {
    map.set(key, []);
  }
  map.get(key).push(word);
}

This code results in a map with the following structure:

{
  "aelnrst": [
    "antlers",
    "rentals",
    "sternal"
  ],
  "aceilrt": [
    "article",
    "recital"
  ],
  "aabflmnoty": [
    "flamboyant"
  ]
}

#
Implementing our own simple hash map

Hash maps are one of many map implementations, and there are many ways to
implement hash maps. The simplest way, and the way we’re going to demonstrate,
is to use a list of lists. The inner lists are often referred to as “buckets” in
the real-world, so that’s what we’ll call them here. A hash function is used on
the key to determine which bucket to store the key-value pair in, then the
key-value pair is added to that bucket.

Let’s walk through a simple hash map implementation in JavaScript. We’re going
to go through it bottom-up, so we’ll see some utility methods before getting to
the set and get implementations.

class HashMap {
  constructor() {
    this.bs = [[], [], []];
  }
}

We start off by creating a HashMap class with a constructor that sets up 3
buckets. We use 3 buckets and the short variable name bs so that this code
displays nicely on devices with smaller screens. In reality, you could have
however many buckets you want (and better variable names).

class HashMap {
  // ...
  bucket(key) {
    let h = murmur3(key);
    return this.bs[
      h % this.bs.length
    ];
  }
}

The bucket method uses murmur3 on the key
passed in to find a bucket to use. This is the only place in our hash map code
that a hash function is used.

class HashMap {
  // ...
  entry(bucket, key) {
    for (let e of bucket) {
      if (e.key === key) {
        return e;
      }
    }
    return null;
  }
}

The entry method takes a bucket and a key and scans the bucket until it
finds an entry with the given key. If no entry is found, null is returned.

class HashMap {
  // ...
  set(key, value) {
    let b = this.bucket(key);
    let e = this.entry(b, key);
    if (e) {
      e.value = value;
      return;
    }
    b.push({ key, value });
  }
}

The set method is the first one we should recognise from our earlier
JavaScript Map examples. It takes a key-value pair and stores it in our hash
map. It does this by using the bucket and entry methods we created earlier.
If an entry is found, its value is overwritten. If no entry is found, the
key-value pair is added to the map. In JavaScript, { key, value } is
shorthand for { key: key, value: value }.

class HashMap {
  // ...
  get(key) {
    let b = this.bucket(key);
    let e = this.entry(b, key);
    if (e) {
      return e.value;
    }
    return null;
  }
}

The get method is very similar to set. It uses bucket and entry to find
the entry related to the key passed in, just like set does. If an entry is
found, its value is returned. If one isn’t found, null is returned.

That was quite a lot of code. What you should take away from it is that our
hash map is a list of lists, and a hash function is used to know which of the
lists to store and retrieve a given key from.

Here’s a visual representation of this hash map in action. Click anywhere on the buckets to add a new key-value pair using our
set method. To keep the visualisation simple, if a bucket were to “overflow”,
the buckets are all reset.

Because we’re using murmur3 as our hash
function, you should see good distribution between the buckets. It’s expected
you’ll see some imbalance, but it should generally be quite even.

To get a value out of the hash map, we first hash the key to figure out which
bucket the value will be in. Then we have to compare the key we’re searching for
against all of the keys in the bucket.

It’s this search step that we minimise through hashing, and why murmur3 is optimised for speed. The faster the hash
function, the faster we find the right bucket to search, the faster our hash
map is overall.

This is also why reducing collisions is so crucial. If we did decide to use that
dummy hash function from all the way at the start of this article, the one that
returns 0 all the time, we’ll put all of our key-value pairs into the first
bucket. Finding anything could mean we have to check all of the values in the
hash map. With a good hash function, with good distribution, we reduce the
amount of searching we have to do to 1/N, where N is the number of buckets.

Let’s see how stringSum does.

Interestingly, stringSum seems to distribute
values quite well. You notice a pattern, but the overall distribution looks
good.

Finally! A win for stringSum. I knew it would
be good for something.

Not so fast, Haskie. We need to talk about a serious problem. The distribution
looks okay on these sequential numbers, but we’ve seen that stringSum doesn’t have a good avalanche effect. This doesn’t end
well.

#
Real-world collisions

Let’s look at 2 real-world data sets: IP addresses and English words. What I’m
going to do is take 100,000,000 random IP addresses and 466,550 English
words
, hash all of them with both murmur3
and stringSum, and see how many collisions we get.

IP Addresses

murmur3 stringSum
Collisions 1,156,959 99,999,566
1.157% 99.999%

English words

murmur3 stringSum
Collisions 25 464,220
0.005% 99.5%

When we use hash maps for real, we aren’t usually storing random values in them.
We can imagine counting the number of times we’ve seen an IP address in rate
limiting code for a server. Or code that counts the occurrences of words in
books throughout history to track their origin and popularity. stringSum sucks for these applications because of it’s extremely
high collision rate.

#
Manufactured collisions

Now it’s murmur3‘s turn for some bad news.
It’s not just collisions caused by similarity in the input we have to worry
about. Check this out.

What’s happening here? Why do all of these jibberish strings hash to the same
number?

I hashed 141 trillion random strings to find values that hash to the number
1228476406 when using murmur3. Hash functions
have to always return the same output for a specific input, so it’s possible to
find collisions by brute force.

I’m sorry, 141 trillion? Like… 141 and then 12 zeroes?

Yes, and it only took me 25 minutes. Computers are fast.

Bad actors having easy access to collisions can be devastating if your software
builds hash maps out of user input. Take HTTP headers, for example. An HTTP
request looks like this:

GET / HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: google.com

You don’t have to understand all of the words, just that the first line is the
path being requested and all of the other lines are headers. Headers are Key: Value pairs, so HTTP servers tend to use maps to store them. Nothing stops us
from passing any headers we want, so we can be really mean and pass headers we
know will cause collisions. This can significantly slow down the server.

This isn’t
theoretical, either
.
If you search “HashDoS” you’ll find a lot more examples of this. It was a really
big deal in the mid-2000s.

There are a few ways to mitigate this specific to HTTP servers: ignoring
jibberish header keys and limiting the number of headers you store, for
example. But modern hash functions like murmur3
offer a more generalised solution: randomisation.

Earlier in this post we showed some examples of hash function implementations.
Those implementations took a single argument: input. Lots of modern hash
functions take a 2nd parameter: seed (sometimes called salt). In the case
of murmur3, this seed is a number.

So far, we’ve been using 0 as the seed. Let’s see what happens with the
collisions I’ve collected when we use a seed of 1.

Just like that, 0 to 1, the collisions are gone. This is the purpose of the
seed: it randomises the output of the hash function in an unpredictable way.
How it achieves this is beyond the scope of this article, all hash functions do
this in their own way.

The hash function still returns the same output for the same input, it’s just
that the input is a combination of input and seed. Things that collide with
one seed shouldn’t collide when using another. Programming languages often
generate a random number to use as the seed when the process starts, so that
every time you run your program the seed is different. As a bad guy, not knowing
the seed, it is now impossible for me to reliably cause harm.

If you look closely in the above visualisation and the one before it, they’re
the same values being hashed but they produce different hash values. The
implication of this is that if you hash a value with one seed, and want to be
able to compare against it in the future, you need to make sure you use the same
seed.

Having different values for different seeds doesn’t affect the hash map
use-case, because hash maps only live for the duration the program is running.
Provided you use the same seed for the lifetime of the program, your hash maps
will continue to work just fine. If you ever store hash values outside of your
program, in a file for example, you need to be careful you know what seed has
been used.

#
Playground

As is tradition, I’ve made a playground for you to write your own hash functions
and see them visualised with the grids seen in this article. Click
here to try it!

#
Conclusion

We’ve covered what a hash function is, some ways to measure how good it is,
what happens when it’s not good, and some of the ways they can be broken by
bad actors.

The universe of hash functions is a large one, and we’ve really only scratched
the surface in this post. We haven’t spoken about cryptographic vs
non-cryptographic hashing, we’ve touched on only 1 of the thousands of use-cases
for hash functions, and we haven’t talked about how exactly modern hash
functions actually work.

Some further reading I recommend if you’re really enthusiastic about this topic
and want to learn more:

  • https://github.com/rurban/smhasher this repository is the gold standard
    for testing how good hash functions are. They run a tonne of tests against
    a wide number of hash functions and present the results in a big table. It
    will be difficult to understand what all of the tests are for, but this is
    where the state of the art of hash testing lives.
  • https://djhworld.github.io/hyperloglog/ this is an interactive piece
    on a data structure called HyperLogLog. It’s used to efficiently count the
    number of unique elements in very, very large sets. It uses
    hashing to do it in a really clever way.
  • https://www.gnu.org/software/gperf/ is a piece of software that, when given
    the expected set of things you want to hash, can generate a “perfect” hash
    function automatically.

Feel free to join the discussion on Hacker News!

#
Acknowledgements

Thanks to everyone who read early drafts and provided invaluable feedback.

And everyone who helped me find murmur3 hash
collisions:

#
Patreon

After the success of Load Balancing and Memory
Allocation
, I have decided to set up a Patreon page:
https://patreon.com/samwho. For all of these articles going forward, I am
going to post a Patreon-exclusive behind-the-scenes post talking about
decisions, difficulties, and lessons learned from each post. It will give you
a deep look in to how these articles evolve, and I’m really stoked about
the one I’ve written for this one.

If you enjoy my writing, and want to support it going forward, I’d really
appreciate you becoming a Patreon. ❤️

Read More