This article is part of our coverage of the latest in AI research.
The transformer has been one of the most influential machine learning architectures in recent years. It underlies some of the most advanced deep learning systems, including large language models like OpenAI’s GPT-3 and DeepMind’s AlphaFold.
The transformer architecture owes its success to its powerful attention mechanism, which enables it to outperform its predecessors, the RNN and LSTM. Transformer models can process long sequences of data in parallel and in both forward and reverse directions.
Given the importance of transformer networks, there are several efforts to improve their accuracy and efficiency. One of these initiatives is a new research project by scientists at University of Cambridge, University of Oxford, and Imperial College of London, which suggests changing the transformer architecture from deep to wide. While a small architectural change, results show that this modification provides significant improvements in the speed, memory, and interpretability of transformer networks.
Improving the transformer architecture
The original transformer architecture, introduced in 2017, is composed of an encoder and decoder module, which use similar components. Later, other variations of the transformer were introduced, some of which use only the encoder or the decoder part. For example, BERT is an encoder-only transformer model while GPT-3 is a decoder-only network.
Consider an encoder-only transformer model that classifies reviews for movies or products as positive or negative. The input text is first transformed into an embedding with positional encoding. Embeddings are multi-dimensional numerical representations of words. Therefore, a string of text becomes an array of multi-dimensional vectors. Positional encoding modifies the embedding values to account for the position of each word in the sequence.
These values are fed to the attention layer, the main building block of transformers. The attention layer is composed of several attention heads. During the training phase, each attention head configures its parameters to capture relations between different inputs. The output can then be flattened and fed to one or more fully connected layers and finally turned into a binary classification output.
Previous attempts to improve transformers mainly focused on creating new attention mechanisms that specialize in specific tasks. The scientists at University of Cambridge, University of Oxford, and Imperial College of London came up with the idea that instead of changing the attention mechanism, why not rethink the general architecture of the transformer? The result is a new technique that improves the performance of transformers while also being agnostic to task and attention mechanism.
“We were originally researching different attention mechanisms, and whether you could combine different attentions for performance gains,” Jason Brown, co-author of the paper and engineering student at University of Cambridge, told TechTalks.
The researchers created a single-layer transformer model which combined many different attention heads as part of their search experiments. To their surprise, they found that the model had an unexpectedly good performance despite its overall smaller size.
“Exploring the reasons behind this, we found it was due to it having the same overall amount of computation in the attention, whilst having only a single layer,” Brown said. “As Transformers are very expensive to train and run, being able to make them more efficient whilst retaining accuracy was an exciting prospect.”
Deep versus wide transformer models
Like most deep learning architectures, the learning capacity of transformer models increases as they become deeper. By stacking several attention layers on top of each other, you can enable the transformer network to learn more complex representations of the input space.
However, the added benefit of adding attention layers comes with several tradeoffs. First, they increase the memory footprint of the neural network. Second, they increase the model’s latency by adding more serial layers of processing. And third, they make the model less interpretable because, with more layers, it becomes harder to relate outputs to specific input points.
The idea that Brown and his co-authors propose is to switch deep networks for wide networks. Therefore, instead of adding attention layers to your network, you add attention heads to your attention layer. The idea is very simple but happens to have a profound effect on the performance of the transformer.
For example, consider a transformer model composed of six attention layers, each of which has eight attention heads. Using the wide-network approach, you can change the architecture to a single attention layer with 48 attention heads, or two attention layers with 24 attention heads, or maybe three layers with 16 attention heads.
The benefits of wide transformers
There are several benefits to this approach. First, while deep and wide transformers have the same number of attention heads, the wide network has fewer parameters because it removes the dense layers that connect each attention layer to the next. In at least one case, the researchers were able to reduce the model to 48 percent of its original size by switching from a deep to a wide architecture. Other configurations resulted in substantive gains in memory savings.
The second benefit is speed. With attention heads processing the input in parallel instead of sequentially, the model has lower latency and responds faster. On CPUs, the researchers were able to increase speed by 3.1X while on GPUs they received a 1.9X speed improvement.
Finally, wide networks are more interpretable than deep transformers because you can directly associate attention head features to input as opposed to going through several layers. “In Transformer-based architectures, the attention mechanism can be inspected for a given output to see what connections between input features each head in each layer found important. For deep networks this process must be performed for each layer and it can often become unclear what the final output was actually,” the researchers write. “In the case of a single layer wide network, interpretability is far easier as only one layer needs to be inspected, and what was considered important for the final output is clearer.”
These improvements can make it possible to run transformers on edge devices with limited resources and in need of real-time inference.
Still more to learn about wide transformer models
According to the findings of Brown and his co-authors, switching from deep to wide transformers not only preserves performance but in some cases results in an improvement in accuracy.
“On average, wider Transformer networks outperform deep ones. This result holds for both the ‘vanilla’ Transformer with dot product attention, and in general for many other types of attention,” the researchers write.
The tests have so far been limited, however. The authors tested the wide transformer architecture on four text classification tasks. Transformers have many more applications, especially in language modeling and text generation. The tests were also limited to transformer models with six attention layers with eight attention heads each. This is very limited in comparison to large language models such as GPT-3, which have dozens of attention layers, each with dozens of attention heads and more than a hundred dimensions.
But so far, wide transformer models have proven to be a very promising direction of research.
“We wanted to test on larger models and on other domains such as language modeling and translation, but we were constrained on time and wanted to publish what we had found so far,” Brown said. “Large transformer models, particularly those used for language modeling like BERT or GPT3, are very expensive to train from scratch due to the large numbers of parameters and training data. We hope future research will explore these directions, and that our overall understanding of the transformer architecture improves.”