July 24, 2023

About which one cannot speak, one must pass over in silence. –Wittgenstein

Do you see the off-by-one error in this formula?

[
textrm{Attention}(Q, K, V) = textrm{softmax}left(frac{QK^T}{sqrt{d}}right)V
]

The attention formula is the central equation of modern AI,
but there’s a bug in it that has been driving me nuts the last week. I tried
writing a serious-looking research paper about the bug and my proposed fix, but
I lost a series of pitched battles against Pytorch and `biblatex`,
so I figured I’d just write a blog post instead. (History is written by the
winners; blogs are written by…)

In this post I’ll explain how (IMHO) the current generation of AI models have
an off-by-one error in a crucial place, and it’s making everyone’s Transformer
models needlessly difficult to compress and deploy. Consider this an opinion
piece, but if anyone out there on the Internet wants to run some experiments and
prove me right, we can collaborate and call ourselves scientists.

First let’s talk about why this off-by-one error matters. ChatGPT works
I first caught the scent of something
amiss whilst minding my business and perusing research papers on quantization,
the technique by which LLM Edgers cram
big models onto Mac Minis and Rasberry Pis and
jailbroken home thermostats.
Everybody in AI is limited by
RAM, so the less RAM you use the more cool stuff you can do, both Up
There In The Cloud
and down here on the edge. LLMs have billions of weights,
and if we can make these weights tighten their haunches and suck in their
paunches, we can compose better sonnets or plagiarize superior essays or
accelerate the end of days, whatever your personal motivation for using
language may be.

RAM stores information. This sounds like a tautology, but hang with me.
Information is negative log-probability,
and is how many bits we need to store things. If a stream of numbers is highly
predictable, for example is always contained in a limited range, we need fewer
bits to store them. If a stream of numbers is not predictable, like once in a
blue moon a mega-number shows up, we need more binary digits to encode the
Colossus.

This is what’s been happening in LLMs – for reasons that are only partially
understood, Transformer models contain these outlier weights and are
emitting Black Swan mega-activations that are much, much, much larger, like
orders of magnitude larger, than their peers. But no one can get rid of them;
the megalodons seem to be critical to the operation of these models, and their
existence is contrary to everything we thought we knew about neural networks
prior to building ones that worked so well. Many papers have been written about
these outlier values, and people have been cooking up all kinds of bit-burning
schemes to encode them with fewer ones and zeroes, because right now we get
pretty gnarly performance degradation with vanilla scale-and-bias integer
quantization. Georgi can tell you more; I’m getting ahead of myself.

The best analysis of all of this comes from a team at Qualcomm AI Research in this paper:
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. The authors traced the existence of these outlier values to the
attention mechanism’s
softmax function,
the seemingly innocent exponentiator that no one thought capable of such
kurtotic barbarities. The researchers came this close to finding the
off-by-one error, like killer-in-the-closet close, but they must all be on
summer vacation in Italy as none of them are responding to my email overtures, and so I
must appeal to the international community of scholars the old-fashioned way.

If you read the linked paper, just ignore their proposals. Sounds harsh, but
hear me out. The clipped softmax comes with a wheel-spinning zero gradient, and
their gated attention proposal, while workable, introduces millions of new
parameters to solve what is really just a failure to increment. There’s a
simple and hindsight-obvious solution here that, from all of my reading, no one
has thought to try.

All right, I’ve talked up this silly error enough. I’ve alluded to its
softmax function
and why it’s
not-quite-the-right tool for the job when it comes to attention.

## The Trouble With Softmax

Now to explain the bug you really have to understand what the attention
mechanism is trying to do. Most numerical bugs are programmers implementing
equations incorrectly. However, when you’re dealing not with bad code, but with
bad math, you need to understand where that equation comes from and what it’s
supposed to be doing before you have any hope of fixing it.

I had to read like 50 arXiV papers to understand all this, and probably I
should have taken a Udemy course or binge-watched Andrej Karpathy’s YouTube
let’s start with what is called the input embedding, which is a floating-point
vector that represents a word in the input string.

This vector seems to get taller every model year, for example the recent LLaMA
2 model from Meta
uses an embedding vector of length 3,204, which works out to
6KB+ in half-precision floating-point, just to represent one word in
the vocabulary, which typically contains 30,000 – 50,000 entries.

Now if you’re a memory-miserly C programmer like me, you might wonder, why in
the world are these AI goobers using 6KB to represent something that ought to
take, like 2 bytes tops? If their vocabulary is less than (2^{16})=65,384, we
only need 16 bits to represent an entry, yeah?

Well, here is what the Transformer is actually doing: it transforms
(eh?) that input vector to an output vector of the same size, and that
final 6KB output vector needs to encode absolutely everything needed to
predict the token after the current one
. The job of each layer of the
Transformer is quite literally adding information to the original, single-word
vector. This is where the residual (née skip) connections come in: all of the
attention machinery is just adding supplementary material to that original two
bytes’ worth of information, analyzing the larger context to indicate, for
instance, that the word pupil is referring to a student, and not to the
hole in your eye. Repeat the attention mechanism a few dozen times and you’ve mastered
the English language and all its vasty contents.

Now the final step of Transformer is to multiply this output vector by a
rectangular matrix, and cram the resulting vocabulary-length vector into a
softmax, treating those exponentiated outputs as next-token probabilities. This
is reasonable, but everyone knows it’s not quite right, as no sane and
respected model out there regards those output probabilities as correct, but
instead every implementation and its sister uses a sampling mechanism to hide
the fact that softmax is over-representing low probabilities.

That’s all fine and workable; softmax in the output step gives us a gradient
for every word in the vocabulary, and it’s a reasonable choice until something
better comes along.

But what I want to argue is that sauce for the goose shouldn’t be slathered on
the gander; the Transformer’s output softmax serves a very different
purpose from the attention mechanism’s internal softmax, and we’d all
do well to get rid of the latter, or at least prop up its denominator with something
handy ⛱️.

### An exponential peg…

So what is softmax? It started out in statistical mechanics as a way to
predict the distribution of states based on their energy levels:

[
p_i propto expleft(-frac{varepsilon_i}{kT}right)
]

Then the economists got a hold of it and realized that if the noise term in
people’s otherwise linear utility functions happened to follow a Gumbel
distribution
(doesn’t yours?), then the probability someone will choose an
item will be proportional to the exponent of the utility inputs:

[
Pr(Y_i=k)=frac{exp(X_i beta_k)}{sum_j exp(X_i beta_j)}
]

This gave softmax a life in multinomial logistic functions; this is where I got
to know the old chap, as in I hand-derived the Hessian of this sucker in grad
school and coded it up to run on GPUs using linear fixed effects, which to my
knowledge no one else has been foolish enough to attempt before or since. I
just mention this because I know the softmax function better than I know many
of my humanoid friends, and I can recognize when it’s being used for things it
oughtn’t.

Softmax is a kind of cheat code to map real-valued numbers to probabilities
that sum to one. In physics, it works quite well; in economics, it’s a little
bit of a lie, but by the time it got to machine learning, it just became a
thing that seemed to work whenever a discrete choice was involved. Softmax
the vector and pick something, all right?

[
(textrm{softmax(x)})_i =frac{exp(x_i)}{sum_j exp(x_j)}
]

Here’s the core mechanic of softmax: it forces a choice among competing
alternatives
, whether it’s particles picking an energy state or consumers
choosing a car. That is, if a softmax mechanism doesn’t want to choose
anything at all
, softmax will require modification, or else we would
expect the softmax to produce distortions once it encounters actual data.

In the context of LLMs, one of those distortions is to heavily weight
non-semantic tokens (commas and such), and those beefy weights become those
difficult-to-compress outlier values that make Life on the Edge harder
than it should be. The Qualcomm AI researchers found that 97%+ of outlier
activations in LLMs occur in whitespace and punctuation positions.

Something’s fishy around here… and I thought I ordered the chicken.

### …in a linear hole

Let’s dive in to how softmax is used inside of attention and see exactly where
it goes wrong. Here’s the equation again:

[
textrm{Attention}(Q, K, V) = textrm{softmax}left(frac{QK^T}{sqrt{d}}right)V
]

Breaking it down: in decoder-only models (i.e., everything since ChatGPT), (Q),
(K), and (V) all originate from the same input sequence. They’re not identical,
as they’ve been projected in different ways along the way, but in each layer

Now: (QK^T) is looking for correlations between token (embedding) vectors in
different positions, essentially building up a square matrix of correlation
(dot-product scaled by (1/sqrt{d})) values, where each column and row corresponds to a token
position. Then each row of this square matrix is softmaxed, with the
resulting probabilities used as a mixing function for the value vectors in the
(V) matrix. The probability-mixed (V) gets added to the input vector, with the
sum passed on down the neural network for further processing.

Multi-head attention goes through this process several times, in parallel, per
layer. Essentially it divides up the embedding vector, and each head uses
information in the entire vector to annotate one (non-overlapping) segment of
the output vector. If you were confused by the Concatenation operation in the
information to Segment 1, Head 2 adds information to Segment 2, and so on.

The problem with using softmax is that it forces each attention head to
make an annotation, even if it has no information to add to the output
vector.
Using softmax to choose among discrete alternatives is great;
using it for optional annotation (i.e. as input into addition) is, like,
not cool, man. The problem here is exacerbated with multi-head attention, as a
specialized head is more likely to want to “pass” than a general-purpose one.
These attention heads are needlessly noisy, a deafening democracy where
abstention is disallowed.

Now it’s possible that softmax should be replaced wholesale, but it’s worked
pretty well for the most part, except for this one wee little bug that prevents
attention heads from saying nothing. So I propose a very small tweak on which I
am willing to stake all future Internet claims to being correct. The tweak
is so small, yet so obvious, and it’s been sitting here under everyone’s noses
ever since attention was invented (2014).

I can’t hear you.

## Softmax One and Quiet Attention

All right, here it is and you are welcome, the long-awaited Softmax Super-Mod that
is soon to set the LLM hacker channels aflame:

[
(textrm{softmax}_1(x))_i = frac{exp(x_i)}{1+sum_j exp(x_j)}
]

Bit of a let-down, eh? All I did was added one to the denominator. This lets
the vector as a whole tend to zero if it wants, but otherwise just shrinks the
values by a small amount, a shrinkage which will be made up for during
normalization, which happens right after attention.

The key difference is in the negative limit, when entries in (x) are significantly
less than zero and the model is trying to avoid an annotation altogether.
Compare the limiting behavior of the original softmax

[
lim_{x_1 to -infty} ldots
lim_{x_k to -infty}
(textrm{softmax}(x))_i = frac{1}{k} gt 0
]

to that of the new and improved softmax1:

[
lim_{x_1 to -infty} ldots
lim_{x_k to -infty}
(textrm{softmax}_1(x))_i = 0
]

Vanilla softmax will always emit the same total weight; softmax1
mostly looks the same, but comes with an escape hatch in the negative orthant.
To be clear: the core issue here is mathematical and not
numerical in nature. Extra precision won’t save the softmax; all
Transformers are affected.

You’ll notice a couple of other things about softmax1. The
derivative is positive, so we always have a non-zero gradient, and its sum is
between zero and one, so the output isn’t out of control. The function maintains the property

[
frac{(textrm{softmax}_1(x))_i}{(textrm{softmax}_1(x))_j}
= frac{(textrm{softmax}(x))_i}{(textrm{softmax}(x))_j}
= frac{exp(x_i)}{exp(x_j)} quad forall i, j
]

i.e. relative values in the output vector are unchanged.

Originally I wanted to call this
function ghostmax, as you can think of there being an extra
zero-valued entry in (x) (as (exp(0)=1)), as well as a zero vector in the
(V) matrix that attenuates the result. But I didn’t want to scare anyone.

Even though softmax1 is facially quite boring, I’m 99.44% sure that it will
resolve the outlier feedback loop that’s making quantization the subject of
cascades of research. If you want to run some experiments and prove me right,
DM me on Twitter and we’ll get a
paper going.

We can call the improved mechanism QuietAttention, as it allows
attention heads to keep their yaps shut. So may I propose to a receptive public

[
textrm{QuietAttention}(Q, K, V) := textrm{softmax}_1 left(frac{QK^T}{sqrt{d}}right)V
]

I think a test could be hacked together fairly quickly: if you prefix every
input context with a zero vector, and ensure that your neural network of choice
doesn’t add any bias (including with the positional encoding), then the zero
should pass through unaltered and will have the effect of adding unity to every
subsequent softmax denominator; that way you won’t lose your mind mucking with
gradient code. I think this could be done with a LLaMA model that uses
fixed embeddings and a special prefix token, but
I’d very much like to get
back to my non-existent hobbies, and
on this problem than my therapist needs to know about.

You’d still have to re-train the model, so don’t try this on an RPi just yet.
But do let me know how those weight kurtoses and activation infinity norms are
looking after a few runs. I’m thinking those numbers will make for a handsome
table in a soon-to-be influential arXiV paper, either when those Qualcomm AI
researchers step off the plane from Italy, or someone in an LLM hacker channel
figures out `biblatex`, whichever happens first.

You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy: