Since its release in November, ChatGPT has captured the imagination of the world. People are using it for all kinds of tasks and applications. It has the potential to change popular applications and create new ones.
But ChatGPT has also triggered an AI arms race between tech giants such as Microsoft and Google. This has pushed the industry toward more competition and less openness on large language models (LLM). The source code, model architecture, weights, and training data of these instruction-following LLMs are not available to the public. Most of them are available either through commercial APIs or black-box web applications.
Closed LLMs such as ChatGPT, Bard, and Claude have many advantages, including ease of access to sophisticated technology. But they also pose limits to research labs and scientists who want to study and better understand LLMs. They are also inconvenient for companies and organizations that want to create and run their own models.
Fortunately, in tandem with the race to create commercial LLMs, there is also a community effort to create open-source models that match the performance of state-of-the-art LLMs. These models can help improve research by sharing results. They can also help prevent a few wealthy organizations from having too much sway and power over the LLM market.
One of the most important open-source language models comes from FAIR, Meta’s AI research lab. In February, FAIR released LLaMA, a family of LLMs that come in four different sizes: 7, 13, 33, and 65 billion parameters. (ChatGPT is based on the 175-billion-parameter InstructGPT model.)
FAIR researchers trained LLaMA 65B and LLaMA 33B on 1.4 trillion tokens, and the smallest model, LLaMA 7B, on one trillion tokens. (GPT-3 175B, which is the base model for InstructGPT, was trained on 499 billion tokens.)
LLaMa is not an instruction-following LLM like ChatGPT. But the idea behind the smaller size of LLaMA is that smaller models pre-trained on more tokens are easier to retrain and fine-tune for specific tasks and use cases. This has made it possible for other researchers to fine-tune the model for ChatGPT-like performance through techniques such as reinforcement learning from human feedback (RLHF).
Meta released the model under “a noncommercial license focused on research use cases.” It will only make it accessible to academic researchers, government-affiliated organizations, civil society, and research labs on a case-by-case basis. You can read the paper here, the model card here, and request access to the trained models here.
(The model was leaked online shortly after its release, which effectively made it available to everyone.)
In March, researchers at Stanford released Alpaca, an instruction-following LLM based on LLaMA 7B. They fine-tuned the LLaMA model on a dataset of 52,000 instruction-following examples generated from InstructGPT.
They used a technique called self-instruct, in which an LLM generates instruction, input, and output samples to fine-tune itself. Self-instruct starts with a small seed of human-written examples that include instruction and output. The researchers use the examples to prompt the language model to generate similar examples. They then review and filter the generated examples, adding the high-quality outputs to the seed pool and removing the rest. They repeat the process until they obtain a large-enough dataset to fine-tune the target model.
According to their preliminary experiments, Alpaca’s performance is very similar to InstructGPT.
The Stanford researchers released the entire self-instruct data set, the details of the data generation process, along with the code for generating the data and fine-tuning the model. (Since Alpaca is based on LLaMA, you must obtain the original model from Meta.)
According to the researchers, the sample-generation fine-tuning cost less than $600, which is very convenient for cash-strapped labs and organizations.
Researchers at UC Berkeley, Carnegie Mellon University, Stanford, and UC San Diego released Vicuna, another instruction-following LLM based on LLaMA. Vicuna comes in two sizes, 7 billion and 13 billion parameters.
The researchers fine-tuned Vicuna using the training code from Alpaca and 70,000 examples from ShareGPT, a website where users can share their conversations with ChatGPT. They made some enhancements to the training process to support longer conversation contexts. They also used the SkyPilot machine learning workload manager to reduce the costs of training from $500 to around $140.
Preliminary evaluations show that Vicuna outperforms LLaMA and Alpaca, and it is also very close to Bard and ChatGPT-4. The researchers released the model weights along with a full framework to install, train, and run LLMs. There is also a very interesting online demo where you can test and compare Vicuna with other open-source instruction LLMs.
Vicuna’s online demo is “a research preview intended for non-commercial use only.” To run your own model, you must first obtain the LLaMA instance from Meta and apply the weight deltas to it.
In March, Databricks released Dolly, a fine-tuned version of EleutherAI’s GPT-J 6B. The researchers were inspired by the work done by the teams behind LLaMA and Alpaca. Training Dolly cost less than $30 and took 30 minutes on a single machine.
The use of the EleutherAI base model removed the limitations Meta imposed on LLaMA-derived LLMs. However, Databricks trained Dolly on the same data that the Standford Alpaca team had generated through ChatGPT. Therefore, the model still couldn’t be used for commercial purposes due to the non-compete limits OpenAI imposes on data generated by ChatGPT.
In April, the same team released Dolly 2.0, a 12-billion parameter model based on EleutherAI’s pythia model. This time, Databricks fine-tuned the model on a 15,000-example dataset instruction-following examples generated fully by humans. They gathered the examples in an interesting, gamified process involving 5,000 of Databricks’ own staff.
Databricks released the trained Dolly 2 model, which has none of the limitations of the previous models and you can use it for commercial purposes. They also released the 15K instruction-following corpus that they used to fine-tune the pythia model. Machine learning engineers can use this corpus to fine-tune their own LLMs.
In all fairness, Open Assistant is such an interesting project that I think it deserves its own independent article. It is a ChatGPT-like language model created from the outset with the vision to prevent big corporations from monopolizing the LLM market.
The team will open-source all their models, datasets, development, data gathering, everything. It is a full, transparent, community effort. All the people involved in the project were volunteers, dedicated to open science. It is a different vision of what is happening behind the walled gardens of big tech companies.
The best way to learn about Open Assistant is to watch the entertaining videos of its co-founder and team lead Yannic Kilcher, who has long been an outspoken critic of the closed approach of organizations such as OpenAI.
OpenAssistant has different versions based on LLaMA and pythia. You can use the pythia version for commercial purposes. Most of the models can run on a single GPU.
More than 13,000 volunteers from across the globe helped collect the examples used to fine-tune the base models. The team will soon release all the data along with a paper that explains the entire project. The trained models are available on Hugging Face. The project’s GitHub page contains the full code for training the model and the frontend to use the model.
The beauty of open source
The recent push to bring open-source LLMs has done a lot to revive the promise of collaborative efforts and shared power that was the original promise of the internet. It shows how all these different communities can help each other and help advance the field.
LLaMA’s open-source models helped spur the movement. The Alpaca project showed that creating instruction-tuned LLMs did not require huge efforts and costs. This in turn inspired the Vicuna project, which further reduced the costs of training and gathering data. Dolly took the efforts in a different direction, showing the benefits of community-led data-gathering efforts to work around the non-compete requirements of commercial models.
There are several other models that are worth mentioning, including UC Berkeley’s Koala and llama.cpp, a C++ implementation of the LLaMA models that can run on ARM processors. It will be interesting to see how the open-source movement develops in the coming months and how it will affect the LLM market.