How to train coding LLMs with small auto-generated datasets

robot writing code
Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Large language models (LLM) such as GPT-4 are remarkably proficient in writing software code. But the costs and opacity of these models have spurred interest in more economical, smaller coding language models. 

These alternatives can be fine-tuned for specific tasks and operated at a fraction of the cost. One big challenge in developing these LLMs is finding the optimal balance between the size of the training dataset and the performance of the model.

Addressing this challenge, a recent Microsoft research paper introduces a novel technique for training efficient coding language models using fewer examples. The paper introduces WaveCoder, a model that claims to outperform models that have been trained on similar-sized datasets. 

Complementing WaveCoder, Microsoft has developed CodeOcean, a curated dataset comprising 20,000 diverse code examples. This dataset can enhance the fine-tuning of foundational models for coding applications.

Choosing the right coding samples

CodeOcean pipeline
CodeOcean pipeline (source: arxiv)

While WaveCoder is an impressive model, the more interesting part of the paper is CodeOcean, the accompanying dataset. CodeOcean addresses a significant challenge: creating a dataset that balances cost-effectiveness with quality. The researchers hypothesize that a dataset with maximum diversity can yield impressive results, even if it contains limited examples.

The team began with CodeSearchNet, an extensive coding dataset comprising 2 million pairs of comments and code. They employed a BERT-based transformer model to generate embeddings for each example, translating the complex information into a numerical list.

They applied a clustering algorithm to the embeddings to sort the examples based on their similarity. This method allowed the researchers to extract a subset from the original dataset that maximizes diversity. 

Adding instructions

After establishing the core dataset, the researchers had to create training examples that included code and instructions. To achieve this, they created a Generator-Discriminator framework for producing instructional data based on the raw code examples. Initially, they used GPT-4 to craft a task definition within a specific scenario context. These initial task definitions, combined with an instructional prompt, were given to GPT-3.5 to generate corresponding instructions for additional examples.

CodeOcean generator-discriminator framework
CodeOcean’s generator-discriminator framework (source: arxiv)

For the discriminator component, the researchers formulated a separate evaluation prompt. This prompt, along with the code and instruction examples, was given to GPT-4 for evaluation. The pipeline then used the good examples for generating future training examples.

The team generated 20,000 high-quality instructional samples through this iterative process. These examples spanned four distinct coding task categories: code generation, code summarization, language translation (from one programming language to another), and code repair. These four categories encompass a large portion of LLM coding tasks.

Training WaveCoder

WaveCoder performance
WaveCoder outperforms other coding LLMs trained on similar number of examples

There are various methods for generating training examples for coding LLMs. But Microsoft’s CodeOcean distinguishes itself with its emphasis on generalization and example efficiency. Unlike studies that rely on vast amounts of data, CodeOcean achieves high performance with a smaller dataset.

To demonstrate the effectiveness of CodeOcean, researchers fine-tuned three coding language models: StarCoder-15B, CodeLLaMA (7B and 13B), and DeepseekCoder-6.7B. Given the size of the dataset, fine-tuning was both fast and cost-efficient. The researchers evaluated the fine-tuned models against three key coding benchmarks: HumanEval, MBPP, and HumanEvalPack.

With a few epochs of training on CodeOcean, all models showed significant improvements on these benchmarks. On code generation, the researchers describe the impact and limitations of WaveCoder: “Following the fine-tuning process, the performance of our models exhibit substantial improvement when compared to both the foundation model and a selection of open-source models, but it still lags behind proprietary models  [like GPT-4 and Gemini] [and] the instructed models training with more than 70K training data.”

The performance difference between WaveCoder and WizardCoder, with 78,000 training examples, is minor. This suggests that “refined and diverse instruction data can significantly improve the efficiency of instruction tuning.”

WaveCoder was particularly superior in code summarization and repair tasks. It outperformed other open-source models across nearly all programming languages. This success emphasizes “the effectiveness of defining and classifying code-related tasks on enhancing the generalization ability of Code LLMs.”

Microsoft has not yet released the model, code, and data for WaveCoder and CodeOcean. But discussions on Hugging Face indicate that Microsoft is in the process of reviewing to possibly release them. Looking ahead, the researchers aim to explore the effect of larger datasets, as well as the potential benefits of combining CodeOcean with other coding datasets.

Please consider supporting TechTalks with a donation

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.