Deep neural networks have a huge advantage: They replace “feature engineering”—a difficult and arduous part of the classic machine learning cycle—with an end-to-end process that automatically learns to extract features.
However, finding the right deep learning architecture for your application can be challenging. There are numerous ways you can structure and configure a neural network, using different layer types and sizes, activation functions, and operations. Each architecture has its strengths and weaknesses. And depending on the application and environment in which you want to deploy your neural networks, you might have special requirements, such as memory and computational constraints.
The classic way to find a suitable deep learning architecture is to start with a model that looks promising and gradually modify it and train it until you find the best configuration. However, this can take a long time, given the numerous configurations and amount of time that each round of training and testing can take.
An alternative to manual design is “neural architecture search” (NAS), a series of machine learning techniques that can help discover optimal neural networks for a given problem. Neural architecture search is a big area of research and holds a lot of promise for future applications of deep learning.
Search spaces for deep learning
While there are numerous neural architecture search techniques, they mostly have a few things in common. In a paper titled “Neural Architecture Search: A Survey,” researchers at Bosch and the University of Freiburg break down NAS into three components: search space, search strategy, and performance estimation strategy.
The first part of a NAS strategy is to define the search space for the target neural network. The basic component of any deep learning model is the neural layer. You can determine the number and type of layers to explore. For example, you might want your deep learning model to be composed of a range of convolutional (CNN) and fully connected layers. You can also determine the configurations of the layers, such as the number of units, or in the case of CNNs, kernel size, number of filters, and stride. Other elements can be included in the search space such as activation functions and operations (pooling layers, dropout layers, etc.).
The more elements you add to the search space, the more versatility you get. But naturally, added degrees of freedom expand the search space and increase the costs of finding the optimal deep learning architecture.
More advanced architectures usually have several branches of layers and other elements. For example, ResNet, a popular image recognition model, uses skip connections, where one layer’s output is provided not only to the next layer but also to layers that are further down the stream. These types of architectures are more difficult to explore with NAS because they have more moving parts.
One of the techniques that help reduce the complexity of the search space while maintaining the complexity of the neural network architecture is the use of “cells.” In this case, the NAS algorithm can optimize small blocks separately and then use them in combination. For example, VGGNet, another famous image recognition network, is composed of repeating blocks composed of a convolution layer, an activation function, and a pooling layer. The NAS algorithm can optimize the block separately and then find the best configuration of blocks in a large network.
Even basic search spaces usually require plenty of trials and error to find the optimal deep learning architecture. Therefore, a neural architecture search algorithm also needs a “search strategy.” The search strategy determines how the NAS algorithm experiments with different neural networks.
The most basic strategy is “random search,” in which the NAS algorithm randomly selects a neural network from the search space, trains and validates it, registers the results, and moves on to the next. Random search is extremely expensive because the NAS algorithm is basically brute-forcing its way through the search space, wasting expensive resources on testing solutions that can be eliminated with easier methods. Depending on the complexity of the search space, random search can take days’ or weeks’ worth of GPU time to verify every possible neural network architecture.
There are other techniques that speed up the search process. An example is Bayesian optimization, which starts with random choices and gradually tunes its search direction as it gathers information about the performance of different architectures.
Another strategy is to frame neural architecture search as a reinforcement learning problem. In this case, the RL agent’s environment is the search space, the actions are the different configurations of the neural network, and the reward is the performance of the network. The reinforcement learning agent starts with random modifications, but over time, it learns to choose configurations that produce better improvements to the neural network’s performance.
Other search strategies include evolutionary algorithms and Monte Carlo tree search. Each search strategy has its strengths and weaknesses, and engineers must find the right balance between “exploration and exploitation,” which basically means testing totally new architectures or tweaking the ones that have so far proven promising.
Performance estimation strategy
As the NAS algorithm goes through the search space, it must train and validate deep learning models to compare their performance and choose the optimal neural network. Obviously, doing full training on each neural network takes a long time and requires very large computational resources.
To reduce the costs of evaluating deep learning models, engineers of NAS algorithms use “proxy metrics” that can be measured without requiring full training of the neural network.
For example, they can train their models for fewer epochs, on a smaller dataset, or on lower resolution data. While the resulting deep learning model will not reach its full potential, these lower fidelity training regimes provide a baseline to compare different models at a lower cost. Once the set of architectures has been culled to a few promising neural networks, the NAS algorithm can do more thorough training and testing of the models.
Another way to reduce the costs of performance estimation is to initialize new models on the weights of previously trained models. Known as transfer learning, this practice results in a much faster convergence, which means the deep learning model will need fewer training epochs. Transfer learning is applicable when the source and destination model have compatible architectures.
A work in progress
Neural architecture search still has challenges to overcome, such as providing explanations of why some architectures are better than others and addressing complicated applications that go beyond simple image classification.
NAS is nonetheless a very useful and attractive field for the deep learning community and can have great applications both for academic research and applied machine learning.
Sometimes, technologies like NAS are depicted as artificial intelligence that creates its own AI, making humans redundant and taking us toward AI singularity. But in reality, NAS is a perfect example of how humans and contemporary AI systems can work together to solve complicated problems. Humans use their intuition, knowledge, and common sense to find interesting problem spaces and define their boundaries and intended result. NAS algorithms, on the other hand, are very efficient problem solvers that can search the solution space and find the best neural network architecture for the intended application.