Now we can tell which characters are tokens and which are simply whitespace, let’s step up a level and talk about scanning.

// Next returns a []byte referencing the the next lexical token in the stream.
// The []byte is valid until Next is called again.
// If the stream is at its end, or an error has occured, Next returns a zero
// length []byte slice.
//
// A valid token begins with one of the following:
//
//	{ Object start
//	[ Array start
//	} Object end
//	] Array End
//	, Literal comma
//	: Literal colon
//	t JSON true
//	f JSON false
//	n JSON null
//	" A string, possibly containing backslash escaped entites.
//	-, 0-9 A number
func (s *Scanner) Next() []byte {
	s.br.release(s.offset) // release the previous token
	s.offset = 0

	c := s.token() // align to the next token
	length := 0
	switch c {
	case ObjectStart, ObjectEnd, Colon, Comma, ArrayStart, ArrayEnd:
		length = 1
		s.offset = 1
	case True:
		length = validateToken(&s.br, "true")
		s.offset = length
	case False:
		length = validateToken(&s.br, "false")
		s.offset = length
	case Null:
		length = validateToken(&s.br, "null")
		s.offset = length
	case String:
		length = parseString(&s.br)
		if length < 2 {
			return nil
		}
		s.offset = length
	case 0:
		// eof
		return nil
	default:
		length = s.parseNumber()
		if length < 0 {
			return nil
		}
	}
	return s.br.window()[:length]
}

This is the core loop of Scanner.Next.
Scanner.Next skips over any intermediate whitespace, determines the token from the first character in the window, then continues to read until the token is read or we hit the end of the input.

Let’s look at how token works, then we’ll talk about some optimisations

// token positions the scanner at the next token in the stream
// and returns the first byte of the token.
func (s *Scanner) token() byte {
	w := s.br.window()
	for {
		for _, c := range w {
			if whitespace[c] {
				s.offset++
				continue
			}

			// release whitespace
			s.br.release(s.offset)
			s.offset = 0
			return c
		}
		if s.br.extend() == 0 {
			// eof
			return 0
		}
		w = s.br.window()[s.offset:]
	}
}
  • We start by getting the current window from the byteReader.
    This is a []byte slice of all the data that is yet to be read.

  • We’re looking for the first non whitespace character.
    If the character is a whitespace we increment s.offset to ignore the character and loop around.

  • If we do find a non whitespace character, we release s.offset characters from the front of the window.
    Now the start of the window is properly aligned with the first character of the token.

  • It turns out that we also get the first character of the token for free, it’s in c, so we can return that as a hint to Scanner.Next.

  • If we run out of characters without hitting a token then we called extend() to grow the window.

  • If we couldn’t grow, then we’ve run out of input and haven’t got a token, so give up.

  • Otherwise update w with a new window.

Some things to note:

  • Note the lack of error handling, it’s not part of the inner loop, it only happens when we have to read more data from the underlying reader via extend.

  • extend hides the process of reading into, growing, refilling the buffer, etc. The makes the caller–Scanner.token–simpler; if there is data in the window, process it, extend if you need too. If you can’t extend, give up.

  • release is similar, it shrinks the start of the window to exclude data that no longer care about.

  • extend is not in the hot path, so there is no need to optimise it, its performance is a function of the buffer it is given.
    In practice an initial buffer of 8k is sufficient.

Let’s talk about the performance of this code.

name                     time/op
Scanner/canada-16         4.41ms ± 2%
Scanner/citm_catalog-16   2.55ms ± 3%
Scanner/twitter-16        1.03ms ± 1%
Scanner/code-16           4.21ms ± 1%
Scanner/example-16        21.4µs ± 1%
Scanner/sample-16          822µs ± 1%

name                     speed
Scanner/canada-16        510MB/s ± 2%
Scanner/citm_catalog-16  677MB/s ± 3%
Scanner/twitter-16       615MB/s ± 1%
Scanner/code-16          461MB/s ± 1%
Scanner/example-16       608MB/s ± 1%
Scanner/sample-16        837MB/s ± 1%

Comparing the performance of Scanner.Next to our whitespace benchmark we can see that we’re between 1/4 and 2/5ths of our baseline.

Let’s talk about improvements we can make to the code.
Note the amount of work being spent to keep s.offset up to date.
We know that s.offset is set to 0 just before Scanner.Next calls this function, and we set s.offset to zero on the way out of the function, so the changes we make to s.offset within the function are invisible—​it’s zero on entry, and zero on exit.

We can rewrite the function to keep a local offset value, which has an impressive effect on token.

func (s *Scanner) token() byte {
	w := s.br.window()
	offset := 0
	for {
		for _, c := range w {
			if whitespace[c] {
				offset++
				continue
			}

			// release whitespace
			s.br.release(offset)
			return c
		}
		if s.br.extend() == 0 {
			// eof
			return 0
		}
		w = s.br.window()[offset:]
	}
}
name                     old time/op    new time/op    delta
Scanner/canada-16          4.39ms ± 1%    4.43ms ± 4%     ~     (p=1.000 n=5+5)
Scanner/citm_catalog-16    2.52ms ± 1%    1.80ms ± 4%  -28.46%  (p=0.008 n=5+5)
Scanner/twitter-16         1.03ms ± 2%    0.95ms ± 3%   -7.41%  (p=0.008 n=5+5)
Scanner/code-16            4.24ms ± 2%    4.18ms ± 1%     ~     (p=0.095 n=5+5)
Scanner/example-16         21.4µs ± 1%    18.9µs ± 2%  -11.68%  (p=0.008 n=5+5)
Scanner/sample-16           828µs ± 2%     528µs ± 2%  -36.24%  (p=0.008 n=5+5)

name                     old speed      new speed      delta
Scanner/canada-16         512MB/s ± 1%   509MB/s ± 4%     ~     (p=1.000 n=5+5)
Scanner/citm_catalog-16   685MB/s ± 1%   958MB/s ± 4%  +39.84%  (p=0.008 n=5+5)
Scanner/twitter-16        616MB/s ± 2%   665MB/s ± 3%   +8.01%  (p=0.008 n=5+5)
Scanner/code-16           458MB/s ± 2%   465MB/s ± 1%     ~     (p=0.095 n=5+5)
Scanner/example-16        608MB/s ± 1%   688MB/s ± 2%  +13.23%  (p=0.008 n=5+5)
Scanner/sample-16         831MB/s ± 2%  1303MB/s ± 2%  +56.84%  (p=0.008 n=5+5)

By keeping offset local the compiler avoided those temporary writes back to memory.
The question I have for you is why did this improve some inputs and not others?

The answer is different inputs have different amounts of whitespace.
For example canada only has 33 whitespace characters whereas citm has 1,227,563.

Quiz me after the talk why the compiler can’t optimise these writes away.

Inlining

There is a larger improvement we can make for the runtime of this code, and it relates to inlining.
Inlining is the process of automatically (or manually) copying the body of a function into, in line with, its caller.
This avoids the overhead of the function call.

The Go compiler has reasonable support for inlining, but has a number of limitations.

 % go build -gcflags=-m=2 2>&1 | grep cannot | grep -v decoder
./reader.go:31:6: cannot inline (*byteReader).extend: function too complex: cost 198 exceeds budget 80
./scanner.go:99:6: cannot inline (*Scanner).token: unhandled op FOR
./scanner.go:130:6: cannot inline validateToken: unhandled op FOR
./scanner.go:153:6: cannot inline parseString: unhandled op FOR
./scanner.go:182:6: cannot inline (*Scanner).parseNumber: unhandled op FOR
./scanner.go:56:6: cannot inline (*Scanner).Next: function too complex: cost 476 exceeds budget 80

The first is the size of the function, byteReader.extend cannot be inlined because it is too complex.

The second is statements within the function, Scanner.token cannot be inlined because it contains a for statement.

Let’s go back to the constraints:

  • Scanner.Next is called for each token in the input.

  • This means that Scanner.token is called for each token in the input.

  • Scanner.token cannot be automatically inlined into its caller because it is too complex.

  • Therefore we’re paying two function calls per token.

We can remove one of these by manually inlining Scanner.token into its caller.

func (s *Scanner) Next() []byte {
	// release the previous token
	s.br.release(s.offset)
	w := s.br.window()
	offset := 0
	for {
		for _, c := range w {
			if whitespace[c] {
				offset++
				continue
			}

			// release whitespace
			s.br.release(offset)

			length := 0
			switch c {
			case ObjectStart, ObjectEnd, Colon, Comma, ArrayStart, ArrayEnd:
				length = 1
				s.offset = 1
			case True:
				length = validateToken(&s.br, "true")
				s.offset = length
			case False:
				length = validateToken(&s.br, "false")
				s.offset = length
			case Null:
				length = validateToken(&s.br, "null")
				s.offset = length
			case String:
				// string
				length = parseString(&s.br)
				if length < 2 {
					return nil
				}
				s.offset = length
			default:
				// ensure the number is correct.
				length = s.parseNumber()
				if length < 0 {
					return nil
				}
			}
			return s.br.window()[:length]
		}
		if s.br.extend() == 0 {
			// eof
			return nil
		}
		w = s.br.window()[offset:]
	}
}

The results support our thesis:

name                     old time/op    new time/op    delta
Scanner/canada-16          4.36ms ± 1%    3.50ms ± 0%  -19.68%  (p=0.008 n=5+5)
Scanner/citm_catalog-16    1.80ms ± 1%    1.56ms ± 2%  -13.16%  (p=0.008 n=5+5)
Scanner/twitter-16          965µs ± 2%     833µs ± 2%  -13.75%  (p=0.008 n=5+5)
Scanner/code-16            4.15ms ± 1%    3.61ms ± 1%  -12.82%  (p=0.008 n=5+5)
Scanner/example-16         18.9µs ± 2%    16.6µs ± 1%  -12.42%  (p=0.008 n=5+5)
Scanner/sample-16           515µs ± 1%     472µs ± 2%   -8.34%  (p=0.008 n=5+5)

name                     old speed      new speed      delta
Scanner/canada-16         516MB/s ± 1%   642MB/s ± 0%  +24.50%  (p=0.008 n=5+5)
Scanner/citm_catalog-16   960MB/s ± 1%  1105MB/s ± 2%  +15.16%  (p=0.008 n=5+5)
Scanner/twitter-16        654MB/s ± 2%   759MB/s ± 1%  +15.94%  (p=0.008 n=5+5)
Scanner/code-16           468MB/s ± 1%   537MB/s ± 1%  +14.69%  (p=0.008 n=5+5)
Scanner/example-16        689MB/s ± 2%   787MB/s ± 1%  +14.17%  (p=0.008 n=5+5)
Scanner/sample-16        1.33GB/s ± 1%  1.46GB/s ± 1%   +9.11%  (p=0.008 n=5+5)

By saving the function call we’ve improved throughput by 9-24%.
The largest improvement comes from canada, which basically contained no whitespace, so the call to Scanner.token almost always returned immediately having done no work!

To recap the major optimisations were:

  • Avoiding s.offset updates. They cannot be registerised, CPU has to do a write on every iteration.
    s.offset updates reduced from one per byte to one per token.

  • Scanner.Next and Scanner.token were effectively one function spread over two.
    Each are too large to be inlined, so we’re paying for an extra function call per token.
    Manually inlining them increased the indentation depth of the function, but delivered substantial speedups.

  • Most JSON contains some whitespace, it’s moderately optimised for human readability.
    It turns out, the more whitespace, the faster pkg/json decodes! citm is over 50% of the baseline, sample is nearly 75%.

Take away: Avoiding function calls can improve performance in the hot path.

Read More