By Evan Miller

*July 24, 2023*

About which one cannot speak, one must pass over in silence. –Wittgenstein

Do you see the off-by-one error in this formula?

[

textrm{Attention}(Q, K, V) = textrm{softmax}left(frac{QK^T}{sqrt{d}}right)V

]

The attention formula is the central equation of modern AI,

but there’s a bug in it that has been driving me nuts the last week. I tried

writing a serious-looking research paper about the bug and my proposed fix, but

I lost a series of pitched battles against Pytorch and `biblatex`

,

so I figured I’d just write a blog post instead. (History is written by the

winners; blogs are written by…)

In this post I’ll explain how (IMHO) the current generation of AI models have

an off-by-one error in a crucial place, and it’s making everyone’s Transformer

models needlessly difficult to compress and deploy. Consider this an opinion

piece, but if anyone out there on the Internet wants to run some experiments and

prove me right, we can collaborate and call ourselves scientists.

##
It’s All About Outliers

First let’s talk about why this off-by-one error matters. *ChatGPT works
great, what’s your problem bro?* I first caught the scent of something

amiss whilst minding my business and perusing research papers on quantization,

the technique by which LLM Edgers cram

big models onto Mac Minis and Rasberry Pis and

jailbroken home thermostats.

Everybody in AI is limited by

RAM, so the less RAM you use the more cool stuff you can do, both Up

There In The Cloud and down here on the edge. LLMs have billions of weights,

and if we can make these weights tighten their haunches and suck in their

paunches, we can compose better sonnets or plagiarize superior essays or

accelerate the end of days, whatever your personal motivation for using

language may be.

RAM stores information. This sounds like a tautology, but hang with me.

Information is negative log-probability,

and is how many bits we need to store things. If a stream of numbers is highly

predictable, for example is always contained in a limited range, we need fewer

bits to store them. If a stream of numbers is not predictable, like once in a

blue moon a mega-number shows up, we need more binary digits to encode the

Colossus.

This is what’s been happening in LLMs – for reasons that are only partially

understood, Transformer models contain these outlier weights and are

emitting Black Swan mega-activations that are much, much, much larger, like

orders of magnitude larger, than their peers. *But no one can get rid of them*;

the megalodons seem to be critical to the operation of these models, and their

existence is contrary to everything we thought we knew about neural networks

prior to building ones that worked so well. Many papers have been written about

these outlier values, and people have been cooking up all kinds of bit-burning

schemes to encode them with fewer ones and zeroes, because right now we get

pretty gnarly performance degradation with vanilla scale-and-bias integer

quantization. Georgi can tell you more; I’m getting ahead of myself.

The best analysis of all of this comes from a team at Qualcomm AI Research in this paper:

**Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing**. The authors traced the existence of these outlier values to the

attention mechanism’s

softmax function,

the seemingly innocent exponentiator that no one thought capable of such

kurtotic barbarities. The researchers came *this close* to finding the

off-by-one error, like killer-in-the-closet close, but they must all be on

summer vacation in Italy as none of them are responding to my email overtures, and so I

must appeal to the international community of scholars the old-fashioned way.

If you read the linked paper, just ignore their proposals. Sounds harsh, but

hear me out. The clipped softmax comes with a wheel-spinning zero gradient, and

their gated attention proposal, while workable, introduces millions of new

parameters to solve what is really just a failure to increment. There’s a

simple and hindsight-obvious solution here that, from all of my reading, no one

has thought to try.

All right, I’ve talked up this silly error enough. I’ve alluded to its

location. Let’s talk about the

softmax function

and why it’s

not-quite-the-right tool for the job when it comes to attention.

##
The Trouble With Softmax

Now to explain the bug you really have to understand what the attention

mechanism is trying to do. Most numerical bugs are programmers implementing

equations incorrectly. However, when you’re dealing not with bad code, but with

bad math, you need to understand where that equation comes from and what it’s

supposed to be doing before you have any hope of fixing it.

I had to read like 50 arXiV papers to understand all this, and probably I

should have taken a Udemy course or binge-watched Andrej Karpathy’s YouTube

channel instead, but

let’s start with what is called the input embedding, which is a floating-point

vector that represents a word in the input string.

This vector seems to get taller every model year, for example the recent LLaMA

2 model from Meta uses an embedding vector of length 3,204, which works out to

6KB+ in half-precision floating-point, just to represent *one word* in

the vocabulary, which typically contains 30,000 – 50,000 entries.

Now if you’re a memory-miserly C programmer like me, you might wonder, why in

the world are these AI goobers using 6KB to represent something that ought to

take, like 2 bytes tops? If their vocabulary is less than (2^{16})=65,384, we

only need 16 bits to represent an entry, yeah?

Well, here is what the Transformer is actually doing: it *transforms*

(eh?) that input vector to an output vector *of the same size*, and that

final 6KB output vector needs to encode *absolutely everything needed to
predict the token after the current one*. The job of each layer of the

Transformer is quite literally adding information to the original, single-word

vector. This is where the residual (née skip) connections come in: all of the

attention machinery is just adding supplementary material to that original two

bytes’ worth of information, analyzing the larger context to indicate, for

instance, that the word

*pupil*is referring to a student, and not to the

hole in your eye. Repeat the attention mechanism a few dozen times and you’ve mastered

the English language and all its vasty contents.

Now the final step of Transformer is to multiply this output vector by a

rectangular matrix, and cram the resulting vocabulary-length vector into a

softmax, treating those exponentiated outputs as next-token probabilities. This

is reasonable, but everyone knows it’s not quite right, as no sane and

respected model out there regards those output probabilities as correct, but

instead every implementation and its sister uses a sampling mechanism to hide

the fact that softmax is over-representing low probabilities.

That’s all fine and workable; softmax in the output step gives us a gradient

for every word in the vocabulary, and it’s a reasonable choice until something

better comes along.

But what I want to argue is that sauce for the goose shouldn’t be slathered on

the gander; the Transformer’s *output* softmax serves a very different

purpose from the attention mechanism’s *internal* softmax, and we’d all

do well to get rid of the latter, or at least prop up its denominator with something

handy ⛱️.

### An exponential peg…

So what is softmax? It started out in statistical mechanics as a way to

predict the distribution of states based on their energy levels:

[

p_i propto expleft(-frac{varepsilon_i}{kT}right)

]

Then the economists got a hold of it and realized that if the noise term in

people’s otherwise linear utility functions happened to follow a Gumbel

distribution (doesn’t yours?), then the probability someone will choose an

item will be proportional to the exponent of the utility inputs:

[

Pr(Y_i=k)=frac{exp(X_i beta_k)}{sum_j exp(X_i beta_j)}

]

This gave softmax a life in multinomial logistic functions; this is where I got

to know the old chap, as in I hand-derived the Hessian of this sucker in grad

school and coded it up to run on GPUs using linear fixed effects, which to my

knowledge no one else has been foolish enough to attempt before or since. I

just mention this because I know the softmax function better than I know many

of my humanoid friends, and I can recognize when it’s being used for things it

oughtn’t.

Softmax is a kind of cheat code to map real-valued numbers to probabilities

that sum to one. In physics, it works quite well; in economics, it’s a little

bit of a lie, but by the time it got to machine learning, it just became a

thing that seemed to work whenever a discrete choice was involved. Softmax

the vector and pick something, all right?

[

(textrm{softmax(x)})_i =frac{exp(x_i)}{sum_j exp(x_j)}

]

Here’s the core mechanic of softmax: *it forces a choice among competing
alternatives*, whether it’s particles picking an energy state or consumers

choosing a car. That is, if a softmax mechanism

*doesn’t want to choose*

anything at all, softmax will require modification, or else we would

anything at all

expect the softmax to produce distortions once it encounters actual data.

In the context of LLMs, one of those distortions is to heavily weight

non-semantic tokens (commas and such), and those beefy weights become those

difficult-to-compress outlier values that make Life on the Edge harder

than it should be. *The Qualcomm AI researchers found that 97%+ of outlier
activations in LLMs occur in whitespace and punctuation positions.*

Something’s fishy around here… and I thought I ordered the chicken.

###
…in a linear hole

Let’s dive in to how softmax is used inside of attention and see exactly where

it goes wrong. Here’s the equation again:

[

textrm{Attention}(Q, K, V) = textrm{softmax}left(frac{QK^T}{sqrt{d}}right)V

]

Breaking it down: in decoder-only models (i.e., everything since ChatGPT), (Q),

(K), and (V) all originate from the same input sequence. They’re not identical,

as they’ve been projected in different ways along the way, but in each layer

they all start with the same annotated (added-to) embedding vector.

Now: (QK^T) is looking for correlations between token (embedding) vectors in

different positions, essentially building up a square matrix of correlation

(dot-product scaled by (1/sqrt{d})) values, where each column and row corresponds to a token

position. Then each row of this square matrix is softmaxed, with the

resulting probabilities used as a mixing function for the value vectors in the

(V) matrix. The probability-mixed (V) gets added to the input vector, with the

sum passed on down the neural network for further processing.

Multi-head attention goes through this process several times, in parallel, per

layer. Essentially it divides up the embedding vector, and each head uses

information in the entire vector to annotate one (non-overlapping) segment of

the output vector. If you were confused by the Concatenation operation in the

original Transformer paper, that’s all that’s happening: Head 1 adds

information to Segment 1, Head 2 adds information to Segment 2, and so on.

*The problem with using softmax is that it forces each attention head to
make an annotation, even if it has no information to add to the output
vector.* Using softmax to choose among discrete alternatives is great;

using it for optional annotation (i.e. as input into addition) is, like,

not cool, man. The problem here is exacerbated with multi-head attention, as a

specialized head is more likely to want to “pass” than a general-purpose one.

These attention heads are needlessly noisy, a deafening democracy where

abstention is disallowed.

Now it’s possible that softmax should be replaced wholesale, but it’s worked

pretty well for the most part, except for this one wee little bug that prevents

attention heads from saying nothing. So I propose a very small tweak on which I

am willing to stake all future Internet claims to being correct. The tweak

is so small, yet so obvious, and it’s been sitting here under everyone’s noses

ever since attention was invented (2014).

Are you ready?

Are you *ready-ready*?

I can’t hear you.

##
Softmax One and Quiet Attention

All right, here it is and you are welcome, the long-awaited Softmax Super-Mod that

is soon to set the LLM hacker channels aflame:

[

(textrm{softmax}_1(x))_i = frac{exp(x_i)}{1+sum_j exp(x_j)}

]

Bit of a let-down, eh? All I did was added one to the denominator. This lets

the vector as a whole tend to zero if it wants, but otherwise just shrinks the

values by a small amount, a shrinkage which will be made up for during

normalization, which happens right after attention.

The key difference is in the negative limit, when entries in (x) are significantly

less than zero and the model is trying to avoid an annotation altogether.

Compare the limiting behavior of the original softmax

[

lim_{x_1 to -infty} ldots

lim_{x_k to -infty}

(textrm{softmax}(x))_i = frac{1}{k} gt 0

]

to that of the new and improved softmax_{1}:

[

lim_{x_1 to -infty} ldots

lim_{x_k to -infty}

(textrm{softmax}_1(x))_i = 0

]

Vanilla softmax will always emit the same total weight; softmax_{1}

mostly looks the same, but comes with an escape hatch in the negative orthant.

To be clear: the core issue here is *mathematical* and not

*numerical* in nature. Extra precision won’t save the softmax; all

Transformers are affected.

You’ll notice a couple of other things about softmax_{1}. The

derivative is positive, so we always have a non-zero gradient, and its sum is

between zero and one, so the output isn’t out of control. The function maintains the property

[

frac{(textrm{softmax}_1(x))_i}{(textrm{softmax}_1(x))_j}

= frac{(textrm{softmax}(x))_i}{(textrm{softmax}(x))_j}

= frac{exp(x_i)}{exp(x_j)} quad forall i, j

]

i.e. relative values in the output vector are unchanged.

Originally I wanted to call this

function *ghostmax*, as you can think of there being an extra

zero-valued entry in (x) (as (exp(0)=1)), as well as a zero vector in the

(V) matrix that attenuates the result. But I didn’t want to scare anyone.

Even though softmax_{1} is facially quite boring, I’m 99.44% sure that it will

resolve the outlier feedback loop that’s making quantization the subject of

cascades of research. If you want to run some experiments and prove me right,

DM me on Twitter and we’ll get a

paper going.

We can call the improved mechanism *QuietAttention*, as it allows

attention heads to keep their yaps shut. So may I propose to a receptive public

[

textrm{QuietAttention}(Q, K, V) := textrm{softmax}_1 left(frac{QK^T}{sqrt{d}}right)V

]

I think a test could be hacked together fairly quickly: if you prefix every

input context with a zero vector, and ensure that your neural network of choice

doesn’t add any bias (including with the positional encoding), then the zero

should pass through unaltered and will have the effect of adding unity to every

subsequent softmax denominator; that way you won’t lose your mind mucking with

gradient code. I *think* this could be done with a LLaMA model that uses

fixed embeddings and a special prefix token, but

I’d very much like to get

back to my non-existent hobbies, and

I’ve already spent more time

on this problem than my therapist needs to know about.

You’d still have to re-train the model, so don’t try this on an RPi just yet.

But do let me know how those weight kurtoses and activation infinity norms are

looking after a few runs. I’m thinking those numbers will make for a handsome

table in a soon-to-be influential arXiV paper, either when those Qualcomm AI

researchers step off the plane from Italy, or someone in an LLM hacker channel

figures out `biblatex`

, whichever happens first.

*You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy:*

*Get new articles as they’re published, via LinkedIn, Twitter, or RSS.*

*Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!*

**Wizard**

Statistics the Mac way

Back to Evan Miller’s home page

– Subscribe to RSS

– LinkedIn

– Twitter

## Leave A Comment