OSS-Instruct gives open-source LLMs a huge boost in coding

magic hat with computer code
Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Large language models (LLM) have made impressive advances in programming assistance, offering code-generation capabilities that streamline development. Currently, the models most adept at coding, such as OpenAI’s ChatGPT and GPT-4, are closed source, contain hundreds of billions of parameters, and are trained extensively on diverse, often exclusive datasets. 

However, a new paper from the University of Illinois at Urbana-Champaign and Tsinghua University introduces a significant advancement for open-source LLMs. The paper introduces OSS-Instruct, a new technique that leverages powerful models like GPT-4 to generate diverse fine-tuning examples specifically for coding tasks. These examples can be used to fine-tune open-source models for highly efficient coding.

Using this method, the researchers created Magicoder, a suite of 7-billion-parameter LLMs refined with OSS-Instruct. These models have surpassed most other open-source contenders, even those with substantially more parameters. Remarkably, in certain benchmarks, Magicoder approaches the performance of GPT-3.5, demonstrating OSS-Instruct’s potential to democratize high-quality coding assistance through open-source models.

OSS-Instruct

Creating language models capable of generating code typically involves fine-tuning a foundational model, such as Llama, with a dataset comprising coding instructions and corresponding examples. The crux of fine-tuning for code generation lies in curating this dataset with precise instruction-code pairs. 

A popular technique for generating training examples is Self-Instruct, which gives a handful of problem-solution pairs as seed to a strong LLM such as GPT-4 and prompts it to generate similar examples. These pairs are then used to fine-tune smaller models. Another approach, Code Evol-Instruct, employs heuristics to expand upon these seeds, creating a broader array of examples.

Despite their utility, these methods have intrinsic limitations. They depend on a finite set of manually predefined tasks or heuristics, which can restrict the diversity of generated code.

For instance, Code Alpaca’s instruction fine-tuning dataset was derived from a mere 21 seed examples, highlighting the narrow scope of such techniques.

OSS-Instruct, the innovative technique presented in the new paper, provides an alternative to using constrained seed examples. This method begins by selecting random code snippets from open-source code repositories like GitHub. These snippets are then fed into a powerful LLM like ChatGPT, accompanied by a prompt that instructs the model to craft a new coding problem and its corresponding solution inspired by the snippet.

oss-instruct framework
OSS-Instruct framework (source: arxiv)

A striking feature of OSS-Instruct is its flexibility regarding the seed material. It doesn’t require complete code passages. Even a function signature, a shell script, or a set of library imports is sufficient for the LLM to generate a robust problem/solution pair. This flexibility is key to OSS-Instruct’s ability to generate a highly diverse dataset, in stark contrast to other methods tethered to a limited seed pool.

“OSS-INSTRUCT can directly produce diverse, realistic, and controllable code instructions by providing distinct seed code snippets,” the researchers write. “It opens a new dimension for creating low-bias and high-quality instruction-tuning data from the abundance of open-source references.”

oss-instruct prompt
The full OSS-Instruct prompt template

Magicoder

To test the potential of OSS-Instruct, the researchers used a subset of The Stack, a curated dataset of 3 trillion tokens of permissively licensed source code. The Stack was used in training StarCoder, a popular open-source language model for code generation. 

They extracted 1–15 lines of code at random from each document in their dataset to create a collection of 80,000 unique seeds. These snippets were then passed to GPT-3.5 Turbo along with the OSS-Instruct prompt to generate corresponding problem/solution pairs.

To ensure the quality and uniqueness of their dataset, the team implemented strategies to eliminate repetitive examples and safeguard against data contamination—a scenario where test data inadvertently seeps into training material and results in misleading test results. 

The final result was 75,000 high-quality coding examples, each derived from a single-use snippet, ensuring an unprecedented level of diversity compared to datasets generated from a small batch of seeds.

The researchers then fine-tuned Code Llama-7B with this rich dataset to produce Magicoder-CL, a highly efficient coding language model. 

The researchers note, “OSS-Instruct is orthogonal to existing data generation methods, and they can be combined to further push the boundaries of the models’ coding capabilities.” 

Building on this principle, they further refined Magicoder-CL using evol-codealpaca-v1, an Evol-Instruct dataset comprising 110,000 examples. The enhanced model was named MagicoderS-CL.

Magicoder benchmark performance
Magicoder outperforms many coding LLMs that are substantially larger

The team subjected both models to rigorous testing using the HumanEval and MBPP benchmarks, alongside the more extensive HumanEval+ and MBPP+ datasets, to evaluate their code generation prowess. The findings were compelling: Magicoder-CL surpassed all open-source coding models with up to 16 billion parameters. Even more strikingly, MagicoderS-CL, enriched with additional Evol-Instruct examples, narrowly trailed the 34-billion-parameter WizardCoder-CL on the HumanEval benchmark and exceeded the performance of both WizardCoder-CL-34B and ChatGPT 3.5 on HumanEval+, hinting at its superior robustness in code generation.

Furthermore, the researchers developed two additional Magicoder variants based on the DeepSeek-Coder models. These variants achieved remarkable results on both HumanEval and MBPP benchmarks, doing so with significantly fewer datasets than required by the instruct versions of the model. 

Making better coding LLMs

Magicoder stands out as a highly efficient model, delivering near state-of-the-art performance while requiring only a fraction of the memory and computational power typically needed. Its lean design allows it to operate on consumer-grade GPUs, significantly reducing costs and making advanced code generation accessible to a broader audience. 

The robustness and scalability of OSS-Instruct are evident in the performance of Magicoder. The technique shows promise for further gains, suggesting that expanding the dataset and applying the method to larger models could yield even more impressive results. 

The researchers have open-sourced the model weights, training data, and source code. Developers and organizations can deploy the models immediately or use the OSS-Instruct dataset and codebase to train their own custom models on proprietary coding repositories. (It is worth noting, however, that since the training data was generated with GPT-3.5, its use is subject to OpenAI’s terms of use, which prevents you from releasing competing commercial products trained on OpenAI models.)

Looking ahead, the team behind OSS-Instruct has plans to test the method at larger scales. “In the near future, we will apply OSS-INSTRUCT to larger base models. We will also continue advancing OSS-INSTRUCT by generating higher-quality data with a strategically designed distribution of the seed code snippets and with more advanced teacher LLMs such as GPT-4,” they write.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.