This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.
DeepMind is the latest AI research lab to introduce a deep learning model that can generate software source code with remarkable results. Called AlphaCode, the model is based on Transformers, the same architecture OpenAI uses in its code-generation models.
Programming is one of the promising applications of deep learning and large language models. The growing demand for programming talent has spurred a race to create tools that can make developers more productive and give non-developers tools to create software.
And in this regard, AlphaCode surely impresses. It has managed to solve complicated programming challenges that typically require hours of planning, coding, and testing. It might one day become a good tool to turn problem descriptions into working code.
But it certainly isn’t the equivalent of a human programmer of any level. It’s a totally different approach to creating software, one that isn’t complete without human thinking and intuition.
AlphaCode is not the only game in town, but it accomplishes a very complicated task. Other similar systems focus on generating short code snippets, such as a function or a block of code that performs a small task (e.g., set up a web server, pull information from an API system). While impressive feats, such tasks become trivial when the language model has been exposed to a large enough corpus of source code.
AlphaCode, on the other hand, aims to solve competitive programming problems. Participants in coding challenges must read the challenge description, understand the problem, translate it into an algorithmic solution, implement it in a general-purpose language, and evaluate it against a limited set of test cases. Finally, their results are evaluated based on performance on hidden tests that weren’t available during implementation. A coding challenge can have other conditions such as timing and memory constraints.
Basically, a machine learning model that participates in coding challenges must generate an entire program that solves a problem that is unlike anything it has seen before. It is much more difficult than synthesizing a source code excerpt based on previously seen examples.
The power of transformers and large language models
AlphaCode is yet another example of how large language models have advanced in solving complicated problems. This kind of deep learning system is generally known as a sequence-to-sequence model (seq2seq). Seq2seq algorithms take a sequence of values (letters, pixels, numbers, etc.) as input and produce another sequence of values. This is the approach used in many natural language tasks such as machine translation, text generation, and speech recognition.
According to DeepMind’s paper, AlphaCode uses an encoder-decoder Transformer architecture. Transformers have become especially popular in recent years because they can handle large sequences of data with much less memory and compute requirements than their predecessors, recurrent neural networks (RNN) and long short-term memory networks (LSTM).
The encoder part of AlphaCode creates a numerical representation of the natural language description of the problem. The decoder part takes the embedding vector produced by the encoder and tries to generate the source code of the solution.
Transformer models have proven to be good at such tasks, especially when they are provided with enough training data and computing power. But more than the sheer power of throwing raw data at super-large neural networks, the real brilliance of AlphaCode, in my opinion, has more to do with the ingenuity of DeepMind’s scientists in designing the training process and the algorithm for generating and filtering its results.
Unsupervised and supervised learning
To create AlphaCode, DeepMind’s scientists used a combination of unsupervised pretraining and supervised fine-tuning. Often referred to as self-supervised learning, this is an approach that has become popular for applications where there isn’t enough labeled data or data annotation is expensive and time-consuming.
In the pretraining phase, AlphaCode went through unsupervised learning on 715 gigabytes of data extracted from GitHub. The model is trained by trying to predict the missing parts of a language or code snippet. The advantage of this method is that it doesn’t require any kind of annotation, and by being exposed to more and more samples, the ML model gradually becomes better at creating numerical representations for the structure of text and source code.
The pretrained model is then fine-tuned on CodeContests, an annotated dataset created by the DeepMind team. The dataset contains problem statements, correct and incorrect submissions, and test cases collected from various sources, including Codeforces, Description2Code, and IBM’s CodeNet. The model is trained to transform the textual description of the challenge into the resulting source code. Its results are evaluated with test cases and compared to the correct submissions.
When creating the dataset, the researchers took extra care to avoid historic overlaps between the training, validation, and test sets. This made sure that the ML model would not generate memorized results when faced with coding challenges.
Code generation and filtering
Once AlphaCode was trained, it was tested against problems it hadn’t seen before. When AlphaCode processes a new problem, it generates many solutions. It then uses a filtering algorithm to select the best 10 candidates and submits them to the competition. If at least one of them is correct, then the problem is considered solved.
According to DeepMind’s paper, AlphaCode can generate millions of samples per problem, though it usually generates thousands of solutions. The samples are then filtered to only include those that pass the tests included in the problem statement. This removes approximately 99 percent of the generated samples, according to the paper. But this still leaves thousands of valid samples.
To optimize the sample-selection process, a clustering algorithm is used to divide the solutions into groups. According to the researchers, the clustering process tends to group the working solutions together. This makes it much easier to find a small set of candidates that are likely to pass the hidden tests of the competition.
According to DeepMind, when tested on actual programming competitions on the popular Codeforces platform, AlphaCode ranked among the top 54 percent of participants on average, which is very impressive given the difficulty of coding challenges.
AI vs humans
DeepMind’s blog rightly states that AlphaCode is the first AI code generation system that has “reached a competitive level of performance in programming competitions.”
However, some publications have mistaken this claim for AI coding being “as good as human programmers.” This is the fallacy of comparing narrow AI with the general problem-solving capabilities of humans.
For example, in general, you can expect a person who excels at chess and Go to be smart in many other ways. In fact, you must acquire many other cognitive skills before you can learn and master chess. However, the past decades have proven that an AI system can shortcut its way to very difficult problems without acquiring any of those other skills.
Two prime examples are DeepBlue and AlphaGo, the AI systems that beat the world champions at chess and Go. While both systems were terrific achievements of computer science and artificial intelligence, they only excelled at one task. They could not compete with their human opponents at any other task that required careful planning and strategizing, skills that those humans have acquired before becoming chess and Go masters.
The same thing can be said about competitive programming. A human programmer who reaches a competitive level in coding challenges has spent years studying. They can think abstractly about problems, solve much simpler challenges, write simple programs, and manifest many other skills that are taken for granted and are not evaluated in the programming competition.
In a nutshell, these competitions have been designed for humans. You can be sure that in general, a person who ranks high in competitive programming is a good programmer. This is why many companies use these challenges to make hiring decisions.
AlphaCode, on the other hand, is a shortcut for competitive programming—albeit a brilliant one. It creates novel code. It doesn’t copy-paste from its training data. But it is not the equivalent of an average programmer.
Human programmers use their intuition to direct their limited computing resources in the direction of the right solution. They use debugging, analysis, and review to refine their code. In contrast, AlphaCode generates thousands of samples—sometimes up to 100,000—and filters them to find the ones that work.
As computer science professor Ernest Davis observes, “There is a substantial component of monkeys typing Hamlet going on here. AlphaCode has succeeded in training the monkeys to a remarkable degree, but still they need a lot of them. It then produces 10 candidates, and considers it a success if one of those is correct.”
This is a reference to the infinite monkey theorem, which states that “a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type any given text,” including Shakespeare’s Hamlet.
This is not an attack against AlphaCode. In fact, AlphaCode proves that with ingenious design, enough computing power, and large amounts of data, you can create an AI system that can search a vast solution space that would be impossible to explore through brute-force computation (this is also what DeepMind did with AlphaGo).
However, we must also acknowledge the limits of this approach. First, as Davis notes, the problem becomes extremely harder as the solution becomes longer. “AlphaCode requires 1 million samples to get 34% correct on 20-line programs; to produce a 200 line program—the length of a standard assignment in a second-year computer science class—it might well require 10^60 samples,” he writes.
Second, AlphaCode explicitly requires well-formulated problem statements and test cases to evaluate and filter the thousands of samples it generates. “Now, there is no question that having inputs and outputs provided is enormously useful for the human contestants in programming competitions,” Davis writes. “Nonetheless, if they were not provided, human programmers could in most cases succeed, with a little more work. By contrast, AlphaCode would be completely at a loss without the specific examples provided; the success rate would drop by a factor of about 100.”
Therefore, instead of pitting AlphaCode against human programmers, we should be more interested in what AlphaCode and other similar AI systems can do when teamed up with human programmers. Such tools can have a tremendous impact on the productivity of human programmers. They might even bring changes to the culture of programming, shifting humans toward formulating problems (a discipline that is still the domain of human intelligence) and having AI systems generate the code.
But human programmers will still be in control. They have to learn to harness the power and limits of AI-generated code.
AlphaCode should be recognized for what it is: a code generator that can propose good candidate solutions for well-formulated problem statements. It should also be recognized for what it is not: the digital equivalent of a human programmer.