Blog

How to train coding LLMs with small auto-generated datasets

January 2, 2024

robot writing code — Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Large language models (LLM) such as GPT-4 are remarkably proficient in writing software code. But the costs and opacity of these models have spurred interest in more economical, smaller coding language models.

These alternatives can be fine-tuned for specific tasks and operated at a fraction of the cost. One big challenge in developing these LLMs is finding the optimal balance between the size of the training dataset and the performance of the model.

Addressing this challenge, a recent Microsoft research paper introduces a novel technique for training efficient coding language models using fewer examples. The paper introduces WaveCoder, a model that claims to outperform models that have been trained on similar-sized datasets.

Complementing WaveCoder, Microsoft has developed CodeOcean, a curated dataset comprising 20,000 diverse code examples. This dataset can enhance the fine-tuning of foundational models for coding applications.

Choosing the right coding samples

While WaveCoder is an impressive model, the more interesting part of the paper is CodeOcean, the accompanying dataset. CodeOcean addresses a significant challenge: creating a dataset that balances cost-effectiveness with quality. The researchers hypothesize that a dataset with maximum diversity can yield impressive results, even if it contains limited examples.

The team began with CodeSearchNet, an extensive coding dataset comprising 2 million pairs of comments and code. They employed a BERT-based transformer model to generate embeddings for each example, translating the complex information into a numerical list.

They applied a clustering algorithm to the embeddings to sort the examples based on their similarity. This method allowed the researchers to extract a subset from the original dataset that maximizes diversity.

Adding instructions

After establishing the core dataset, the researchers had to create training examples that included code and instructions. To achieve this, they created a Generator-Discriminator framework for producing instructional data based on the raw code examples. Initially, they used GPT-4 to craft a task definition within a specific scenario context. These initial task definitions, combined with an instructional prompt, were given to GPT-3.5 to generate corresponding instructions for additional examples.

CodeOcean generator-discriminator framework — CodeOcean’s generator-discriminator framework (source: arxiv)

For the discriminator component, the researchers formulated a separate evaluation prompt. This prompt, along with the code and instruction examples, was given to GPT-4 for evaluation. The pipeline then used the good examples for generating future training examples.

The team generated 20,000 high-quality instructional samples through this iterative process. These examples spanned four distinct coding task categories: code generation, code summarization, language translation (from one programming language to another), and code repair. These four categories encompass a large portion of LLM coding tasks.

Training WaveCoder

WaveCoder performance — WaveCoder outperforms other coding LLMs trained on similar number of examples

There are various methods for generating training examples for coding LLMs. But Microsoft’s CodeOcean distinguishes itself with its emphasis on generalization and example efficiency. Unlike studies that rely on vast amounts of data, CodeOcean achieves high performance with a smaller dataset.

To demonstrate the effectiveness of CodeOcean, researchers fine-tuned three coding language models: StarCoder-15B, CodeLLaMA (7B and 13B), and DeepseekCoder-6.7B. Given the size of the dataset, fine-tuning was both fast and cost-efficient. The researchers evaluated the fine-tuned models against three key coding benchmarks: HumanEval, MBPP, and HumanEvalPack.

With a few epochs of training on CodeOcean, all models showed significant improvements on these benchmarks. On code generation, the researchers describe the impact and limitations of WaveCoder: “Following the fine-tuning process, the performance of our models exhibit substantial improvement when compared to both the foundation model and a selection of open-source models, but it still lags behind proprietary models [like GPT-4 and Gemini] [and] the instructed models training with more than 70K training data.”

The performance difference between WaveCoder and WizardCoder, with 78,000 training examples, is minor. This suggests that “refined and diverse instruction data can significantly improve the efficiency of instruction tuning.”

WaveCoder was particularly superior in code summarization and repair tasks. It outperformed other open-source models across nearly all programming languages. This success emphasizes “the effectiveness of defining and classifying code-related tasks on enhancing the generalization ability of Code LLMs.”

Microsoft has not yet released the model, code, and data for WaveCoder and CodeOcean. But discussions on Hugging Face indicate that Microsoft is in the process of reviewing to possibly release them. Looking ahead, the researchers aim to explore the effect of larger datasets, as well as the potential benefits of combining CodeOcean with other coding datasets.

Please consider supporting TechTalks with a donation

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00

$15.00

$100.00

$5.00

$15.00

$100.00

$5.00

$15.00

$100.00

Or enter a custom amount

Your contribution is appreciated.

Donate Donate monthly Donate yearly

Will infinite context windows kill LLM fine-tuning and RAG?

How to turn any LLM into an embedding model

AI in healthcare: Real-world applications for cost-savings and innovation

Stanford’s ReFT fine-tunes LLMs at a fraction of the cost

How generative AI is transforming the shopping experience

Fine-tune a Llama-2 language model with a single instruction

What to know about the rising threat of deepfake scams

4 reasons to use open-source LLMs (especially after the OpenAI drama)

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT

How to make your LLMs lighter with GPTQ quantization

What to know about open-source alternatives to GPT-4 Vision

The complete guide to LLM compression

A simple guide to gradient descent in machine learning

The complete guide to LLM fine-tuning

What is low-rank adaptation (LoRA)?

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Democratizing the hardware side of large language models

How to train coding LLMs with small auto-generated datasets

Choosing the right coding samples

Adding instructions

Training WaveCoder

Make a one-time donation

Make a monthly donation

Make a yearly donation

Like this:

Leave a ReplyCancel reply

Choosing the right coding samples

Adding instructions

Training WaveCoder

Make a one-time donation

Make a monthly donation

Make a yearly donation

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks