Semantic search is great for building chatbots, but indexing can be slow. In our case (bloop is a chatbot that answers questions about your code) we split code files into chunks, feed them to an embedding model and stick the resulting vectors into a vector DB. We do this on-device with a local embedding model so that code doesn’t leave the device during indexing, and to save index disk-space by using a model that generates really small vectors.

Code search is most useful on large codebases, and it turns out that some of our users were indexing very large repositories. As large as LLVM with 13 million lines of code. So we set ourselves the challenge of radically improving our index speeds by running the embedding model on the MacBook’s GPU. This turned out to be more of an undertaking than we’d anticipated, taking us on a journey through the fast-moving world of open-source AI.

On CPU we were using the MiniLM-L6-v2 embedding model which we’d converted to ONNX and quantised to 8-bits. It had several characteristics that we liked:

  1. It’s small (on disk it takes up 22Mb)

  2. It generates 384 dimensional vectors (smaller than most comparable models)

With it we could embed 20, 256-token sequences a second. Assuming that there are 18 tokens in a line of code, it would take nearly 12 hours to index LLVM. 

Speeding this up by sticking it on the MacBook GPU seemed straightforward. One of ONNX Runtime’s selling points is that it can take advantage of a load of different hardware accelerators, including Apple Core ML. All we had to do was tell ONNX Runtime to use Core ML and it should magically take care of the rest, running our embedding model on the GPU:

Environment::builder()
       .with_log_level(LoggingLevel::Warning)
       .with_execution_providers([ExecutionProvider::coreml()])
       .with_telemetry(false)
       .build()?

But, when we checked out the GPU usage, nothing…

We took a closer look at the ONNX Runtime documentation and it turns out that not all of the operations in our MiniLM model are supported for Core ML (like GeLU and QuantizeLinear). Adding support for them ourselves didn’t look too appetising. So we cast around for alternatives.

ggml has been getting a lot of attention in the open-source AI community recently and it’s being used to run some pretty big LLMs on-device (it powers llama.cpp). It also supports 4-bit quantisation, half the lowest precision supported by ONNX, and now can run operations on MacBook GPUs with Metal Performance Shaders. The performance is impressive:

But surprisingly we couldn’t find any projects using ggml for GPU embedding. We were going to have to do it ourselves. How hard could it be?

At this point that we should note that bloop is written in Rust. Rather than directly interface with ggml (which is written in C++) we used llm an excellent crate that wraps it.

Step one was to port a ggml implementation of Bert – the transformer architecture that MiniLM uses – into llm. This went smoothly, and it wasn’t long before we had something up and running, but we quickly spotted a problem. When we inspected the vector outputs we found that they contained NaNs (not-a-number). Even worse, this behaviour was non-deterministic. For the same input sequence we’d sometimes output NaNs and sometimes not.

Debugging this was tricky. We dropped down into ggml itself and found the ggml_metal_graph_compute function which is responsible for passing input data through the model itself. It queues up the operations defined in the computation graph and executes the corresponding metal kernels. At this level we have access to the intermediate tensors, so we logged these and found the culprit. The GeLU activation function in the Multi-Head Attention layer was non-deterministic:

// intermediate output
current = ctx0.op_mul_mat(&self.layers[il].ff_i_w, &current);
current = ctx0.op_add(&current, &self.layers[il].ff_i_b);
current = ctx0.op_gelu(&current);

We started psyching ourselves up to delve into the implementation of the GeLU kernel, but then we noticed that a fix had just been pushed to llama.cpp! All we needed to do was use the precise::tanh function in the kernel:

kernel void kernel_gelu(
    device const float4 * src0,
    device       float4 * dst,
    uint tpig[[thread_position_in_grid]]) {
    device const float4 & x = src0[tpig];

    // BEWARE !!!
    // Simply using "tanh" instead of "precise::tanh" will sometimes results in NaNs!
    // This was observed with Falcon 7B and 40B models
    //
    dst[tpig] = 0.5f*x*(1.0f + precise::tanh(SQRT_2_OVER_PI*x*(1.0f + GELU_COEF_A*x*x)));
}

No more NaNs. Now we were rolling. We still had to check whether the vectors that our ggml model was generating were similar to the ones we were generating with ONNX (we knew that they weren’t going to be identical as the ggml model uses 4-bit rather than 8-bit quantisation). So we benchmarked the embeddings produced by both models on our internal retrieval benchmark and found that the Mean Reciprocal Rank (a common IR metric) was close:

Model MRR
ONNX 0.47058
GGML 0.498775

Even though the ggml embeddings aren’t the same, when using them for semantic retrieval a relevant chunk of code appears on average as the second ranked result in the search results list. Now we could hit run and…

Boom! Look at that GPU spike.

But we weren’t finished yet. The real beauty of indexing on the GPU is to batch inputs, parallelising the computation across multiple sequences. In our indexing pipline we split code files into chunks roughly 256 tokens long, so we wouldn’t waste much time computing padding tokens between inputs of varying length. And all we’d need to do is add an extra batch_size dimension to all our tensors, right? Well, ggml doesn’t have Metal kernels for 4 dimensional matrix multiplications, but that’s not a problem. We can reshape the [d_head, sequence_len, n_head, batch_size] tensors to [d_head, sequence_len, n_head * batch_size], perform the matrix multiplication, then reshape them back:

let q = ctx0.op_reshape_3d(&q, d_head, sequence_len, n_head * batch_size);
let k = ctx0.op_reshape_3d(&k, d_head, sequence_len, n_head * batch_size);
let mut kq = ctx0.op_mul_mat(&k, &q);

kq = ctx0.op_scale(
       &kq,
       &ctx0.new_f32(1.0 / ((n_embd as f32 / n_head as f32).sqrt())),
);

let kq = ctx0.op_reshape_4d(&kq, sequence_len, sequence_len, n_head, batch_size);

After some fiddling around with padding and masking we had it. Batched embedding running on the MacBook GPU. 

Time to run some benchmarks. We collected 10 sequences from a sample codebase, where each was 256 tokens long. For models without batching we timed how long it took to pass each of the 10 sequences to the model in serial, and for batched models we passed all 10 sequences at once in a single batch. 

Here are the results on an M2 Mac mini (8GB unified memory):

That’s roughly a 13x speedup over our original CPU model. At that rate it would only take 54 minutes to index LLVM. Not instantaneous, but you wouldn’t have to leave it running overnight. And even better, the 4-bit ggml model is faster on the CPU so we can use it on machines that don’t have a GPU too.

So what did we learn? While the ecosystem for running open-source AI models on-device is immature, and there are still plenty of holes, it’s developing incredibly fast! Features and fixes are rushing into libraries like ggml and llm. And a lot of it is being built in Rust 🦀. We didn’t even get to talk about other Rust ML projects like burn or Huggingface’s Candle.

Thanks to all of the open-source projects mentioned in this post. You can check out our fork of llm here, our quantised ggml-format MiniLM model here, and our benchmark repo here.

Read More