Run large language models at home, BitTorrent‑style

  • Generate text with Llama 2 (70B), Falcon (180B), BLOOM (176B) (or their derivatives),
    and fine‑tune them for your tasks — using a consumer-grade GPU or Google Colab.
  • You load a small part of the model, then join a
    of people serving the other parts.
    Single‑batch inference runs at up to 6 tokens/sec for Llama 2 (70B) and
    up to 4 tokens/sec for Falcon (180B) — enough for
    chatbots and interactive apps.
  • Beyond classic LLM APIs —
    you can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states.
    You get the comforts of an API with the flexibility of PyTorch and 🤗 Transformers.

Top contributors right now:


Follow development in Discord or via email:

We send updates once a few months. No spam.

We sent you an email to confirm your address. Click it and you’re in!

Featured on:

This project is a part of the BigScience research workshop.

Read More