Mandela’s Library of Alexandria

Man working with Internet-in-a-Box home page

Internet-in-a-Box learning content examples

Quality Content

Internet-in-a-Box shows you the latest Content Packs
installable in the languages your community needs (from online
libraries like
Kiwix,
OER2Go,
Archive.org)
then takes care of all the downloading details for you!

See

Mexico’s live demo

and our

medical examples

used by clinics in Asia and Africa especially, as hosted by
Wikipedia.

Schools can also choose among

almost 40 powerful apps

for teachers and students — optionally with a complete LMS
(learning management system) like Kolibri, Moodle, Nextcloud,
Sugarizer or WordPress.

Two Haitian schoolgirls working on a laptop

Friendly Community

Internet-in-a-Box is a

community product

enabled by professional volunteers working

side-by-side

with schools, clinics and libraries around the world — and the

Wikipedia community

especially.

Thank you everyone for humbly being part of this

OFF.NETWORK

grassroots learning

movement
.

Please consider

how you too might assist

this epic effort.
It’s astonishing how far we’ve come since Internet-in-a-Box’s
original demo in 2013 — and how far we will go together,
If You Too Can Help!

Read More

RedPajama: Reproduction of Llama with Friendly License

Foundation models such as GPT-4 have driven rapid improvement in AI. However, the most powerful models are closed commercial models or only partially open. RedPajama is a project to create a set of leading, fully open-source models. Today, we are excited to announce the completion of the first step of this project: the reproduction of the LLaMA training dataset of over 1.2 trillion tokens.

The most capable foundation models today are closed behind commercial APIs, which limits research, customization, and their use with sensitive data. Fully open-source models hold the promise of removing these limitations, if the open community can close the quality gap between open and closed models. Recently, there has been much progress along this front. In many ways, AI is having its Linux moment. Stable Diffusion showed that open-source can not only rival the quality of commercial offerings like DALL-E but can also lead to incredible creativity from broad participation by communities around the world. A similar movement has now begun around large language models with the recent release of semi-open models like LLaMA, Alpaca, Vicuna, and Koala; as well as fully-open models like Pythia, OpenChatKit, Open Assistant and Dolly.

We are launching RedPajama, an effort to produce a reproducible, fully-open, leading language model. RedPajama is a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute. RedPajama has three key components:

  1. Pre-training data, which needs to be both high quality and have broad coverage

  2. Base models, which are trained at scale on this data

  3. Instruction tuning data and models, which improve the base model to make it usable and safe

Today, we are releasing the first component, pre-training data.

“The RedPajama base dataset is a 1.2 trillion token fully-open dataset created by following the recipe described in the LLaMA paper.”

Our starting point is LLaMA, which is the leading suite of open base models for two reasons: First, LLaMA was trained on a very large (1.2 trillion tokens) dataset that was carefully filtered for quality. Second, the 7 billion parameter LLaMA model is trained for much longer, well beyond the Chincilla-optimal point, to ensure the best quality at that model size. A 7 billion parameter model is particularly valuable for the open community as it can run on a wide variety of GPUs, including many consumer grade GPUs. However, LLaMA and all its derivatives (including Alpaca, Vicuna, and Koala) are only available for non-commercial research purposes. We aim to create a fully open-source reproduction of LLaMA, which would be available for commercial applications, and provide a more transparent pipeline for research.

The RedPajama base dataset

The full RedPajama 1.2 trillion token dataset and a smaller, more consumable random sample can be downloaded through Hugging Face.

RedPajama-Data-1T consists of seven data slices:

  • CommonCrawl: Five dumps of CommonCrawl, processed using the CCNet pipeline, and filtered via several quality filters including a linear classifier that selects for Wikipedia-like pages.

  • C4: Standard C4 dataset

  • GitHub: GitHub data, filtered by licenses and quality

  • arXiv: Scientific articles removing boilerplate

  • Books: A corpus of open books, deduplicated by content similarity

  • Wikipedia: A subset of Wikipedia pages, removing boilerplate

  • StackExchange: A subset of popular websites under StackExchange, removing boilerplate

For each data slice, we conduct careful data pre-processing and filtering, and tune our quality filters to roughly match the number of tokens as reported by Meta AI in the LLaMA paper:

  RedPajama     LLaMA*  
CommonCrawl   878 billion   852 billion  
C4 175 billion 190 billion
Github 59 billion 100 billion
Books 26 billion 25 billion
ArXiv 28 billion 33 billion
Wikipedia 24 billion 25 billion
StackExchange 20 billion 27 billion
Total 1.2 trillion 1.25 trillion

* estimated from Table 1 in https://arxiv.org/abs/2302.13971

We are making all data pre-processing and quality filters openly available on Github. Anyone can follow the data preparation recipe and reproduce RedPajama-Data-1T.

Interactively analyzing the RedPajama base dataset

In collaboration with the Meerkat project, we are releasing a Meerkat dashboard and embeddings for exploring the Github subset of the corpus. The image below shows a preview of the dashboard.

Interactively explore the data in the RedPajama base dataset and view matching records using Meerkat dashboard.

You can find instructions on how to install and use the dashboard on Github.

Up next: Models, instructions & OpenChatKit

Having reproduced the pre-training data, the next step is to train a strong base model. As part of the INCITE program, with support from Oak Ridge Leadership Computing Facility (OLCF), we are training a full suite of models, with the first becoming available in the coming weeks.

With a strong base model in hand, we are excited to instruction tune the models. Alpaca illustrated the power of instruction tuning – with merely 50K high-quality, diverse instructions, it was able to unlock dramatically improved capabilities. Via OpenChatKit, we received hundreds of thousands of high-quality natural user instructions, which will be used to release instruction-tuned versions of the RedPajama models.

Acknowledgements

We are appreciative to the work done by the growing open-source AI community that made this project possible.

That includes:

Get notified of future posts and updates:

Read More

By |2023-04-17T19:24:42+00:00April 17, 2023|Entertainment|0 Comments

About the Author:

Leave A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Go to Top