This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.
In the 1980s, Geoffrey Hinton was one of the scientists who invented backpropagation, the algorithm that enables the training of deep neural networks. Backpropagation was key to the success of deep learning and its widespread use today.
But Hinton, who is one of the most celebrated artificial intelligence scientists of our time, thinks it is time that we think beyond backpropagation and look for other, more efficient ways to train neural networks. And like many other scientist, he draws inspiration from the human brain.
In a new paper presented at NeurIPS 2022, Hinton introduced the “forward-forward algorithm,” a new learning algorithm for artificial neural networks inspired by our knowledge about neural activations in the brain. Though still in early experimentation, forward-forward has the potential to replace backprop in the future, Hinton believes. The paper also proposes a new model for “mortal computation,” which brings us closer to the energy-efficient mechanisms of the brain and can support the forward-forward algorithm.
The problem with backpropagation
When a deep neural network is in training, it goes through two phases. First is the forward pass, in which the input features go through the layers of the model and are modified by the weights, activation functions, and other transformations to create a prediction. The prediction is then compared to the ground truth and the error is calculated.
In the second phase, backpropagation, the error is propagated in reverse order through the layers and the weights of artificial neurons are corrected. Backpropagation uses the partial derivatives and the chain rule to calculate gradients and adjust each artificial neuron based on its contribution to the error.
While backpropagation is extremely useful, it is very different from what we know about the brain.
“As a model of how cortex learns, backpropagation remains implausible despite considerable effort to invent ways in which it could be implemented by real neurons,” Hinton writes. “There is no convincing evidence that cortex explicitly propagates error derivatives or stores neural activities for use in a subsequent backward pass.”
For example, in the visual system, the connections between different cortical areas do not mirror the bottom-up connections of backpropagation-based deep learning models. Instead, they go in loops, in which neural signals traverse several cortical layers and then return to earlier areas.
One of the main limits of backpropagation is the detachment of learning and inference. To adjust the weights of a neural network, the training algorithm must stop inference to perform backpropagation. In the real world, the brain receives a constant stream of information, and “the perceptual system needs to perform inference and learning in real time without stopping to perform backpropagation,” Hinton writes.
Backpropagation also doesn’t work if the computation done in the forward pass is not differentiable. “If we insert a black box into the forward pass, it is no longer possible to perform backpropagation unless we learn a differentiable model of the black box,” Hinton writes.
Where backpropagation is not possible, reinforcement learning can serve as an alternative algorithm for training neural networks. But reinforcement learning is computationally expensive, slow, and unstable when the neural network contains millions or billions of parameters.
The forward-forward algorithm
The idea behind the forward-forward algorithm is to replace the forward and backward passes of backpropagation with two forward passes. The two passes are similar, but they work on different data and have opposite objectives.
The “positive pass” operates on real data and adjusts the weights of the network to increase the “goodness” of each layer. The “negative pass” operates on negative data and adjusts the weights to reduce goodness.
In the paper, Hinton measures goodness as the sum of squared neural activities and the negative sum of squared activities. The learning aims to adjust the neural network’s weights in a way that is above a certain threshold for real examples and below the threshold for negative data.
During training, the examples and their labels are merged into a single vector to create positive data. To create negative data, the same process is used with the slight difference that a false label is attached to the example. After the data is run through the neural network, the weights are adjusted to increase the goodness (e.g., sum of square of activations) for the positive examples and decrease it for the negative examples. No backpropagation is required.
This process works well for a neural network with a single hidden layer. For a multi-layer deep learning model, the output of each hidden layer is normalized before being passed on to the next one.
The FF algorithm can also benefit from some form of self-supervised learning that can make the neural networks more robust while reducing the need for manually labeled examples. In the paper, Hinton shows how simple masking techniques can create negative examples to train the model.
Unlike backpropagation, the FF algorithm also works if it contains black-box modules. Since the algorithm does not require differentiable functions, it can still tune its trainable parameters without knowing the inner workings of every layer in the model.
“The Forward-Forward algorithm (FF) is comparable in speed to backpropagation but has the advantage that it can be used when the precise details of the forward computation are unknown,” Hinton writes.
Testing the forward-forward algorithm
Hinton tested the FF algorithm on a neural network composed of four fully connected layers, each containing 2,000 neurons and ReLU activation. The network was trained on the MNIST dataset for 60 epochs to reach 1.36 percent error rate. In comparison, backpropagation takes 20 epochs to achieve similar results. By doubling the learning rate, Hinton was able to reach 1.40 percent error rate in 40 epochs.
Hinton notes that the forward-forward algorithm “does not generalize quite as well on several of the toy problems investigated in this paper so it is unlikely to replace backpropagation for applications where power is not an issue.”
According to Hinton, two areas in which the forward-forward algorithm may be superior to backpropagation are as “a model of learning in cortex” and “as a way of making use of very low-power analog hardware without resorting to reinforcement learning.”
“It is nice to see Hinton still trying to draw inspiration from biological neural networks to find a way to make the AI more efficient,” said Daeyeol Lee, neuroscientist and the author of The Birth of Intelligence. “I think that new ideas like FF can stimulate the research on the fundamental computations performed by neurons and synapses.”
Meanwhile, Saty Raghavachary, associate professor of computer science at the University of Southern California, notes that while FF seems like an alternative to backprop, it needs to be further investigated.
“FF is biologically more plausible compared to BP. But in the bigger picture, it is simply an alternative way to learn existing patterns in captured data, and by itself, will not lead to human-like intelligence—given that it deals exclusively in (learning patterns in) human-generated, human-labled data,” Raghavachary said.
In the paper, Hinton challenges the established computing paradigm and suggests a new direction.
“General purpose digital computers were designed to faithfully follow instructions because it was assumed that the only way to get a general purpose computer to perform a specific task was to write a program that specified exactly what to do in excruciating detail,” Hinton writes. “This is no longer true, but the research community has been slow to comprehend the long-term implications of deep learning for the way computers are built.”
Hinton specifically challenges the idea of separating software and hardware. While this separation has many benefits, including copying the same software to millions of computers, it also has limits.
“If… we are willing to abandon immortality it should be possible to achieve huge savings in the energy required to perform a computation and in the cost of fabricating the hardware that executes the computation,” Hinton writes. “We can allow large and unknown variations in the connectivity and non-linearities of different instances of hardware that are intended to perform the same task and rely on a learning procedure to discover parameter values that make effective use of the unknown properties of each particular instance of the hardware.”
Basically, this means that the software is gradually built on each hardware instance and is only useful for that instance. Accordingly, the software becomes “mortal,” which means it dies with the hardware. Hinton makes propositions on how a mortal computer can transfer its knowledge to another hardware instance by training it. Despite its disadvantages, mortal computation can have clear advantages, according to Hinton.
“If you want your trillion parameter neural net to only consume a few watts, mortal computation may be the only option,” Hinton writes, though he also asserts that its feasibility will depend on finding a learning procedure that can run efficiently in hardware whose precise details are unknown.
The forward-forward algorithm is a promising candidate, though it remains to be seen how well it scales to large neural networks, Hinton writes.
“I think the (false) analogy between digital computers and the brain has counterproductively promoted the idea that synapses might be analogous to transistors (binary switches), while in reality, synapses are much more complex,” Lee said. “Once people appreciate what the FF models can do, this might lead to some new ideas about the building blocks of cortical computations, including the feedback (top-down) connections in the brain. Given that the brain is full of recurrent and parallel connectivity, FF seems very plausible to implement in the brain.”
Raghavachary, however, believes that biological-like intelligence requires an architecture that can learn about the world by directly, physically, interactively and continuously (DPIC) engaging with the world in order to experience it non-symbolically.
“In sufficiently complex architectures, symbol processing (e.g., language acquisition) could occur, which would be ‘grounded’ via the underlying and complementary non-symbolic experience,” Raghavachary said. “FF seems to be an alternative way to do symbol processing, which is useful, but in my opinion, a full-blown AGI architecture would require the direct experience, non-symbolic, counterpart which only embodiment can provide.”
Peter Robin Hiesinger, professor of neurobiology and author of The Self-Assembling Brain, points out that the key concern is information. In biology, there is a lot of information contained in biological neural networks prior to learning, he said.
“It is in the connectivity. It is in the molecular composition of each individual synaptic connection, which is so much more than just a ‘synaptic weight’. And it all got there—in biology anyways—through genetically encoded development prior to learning,” Hiesinger said.
Genetically encoded development solves two problems raised by Hinton, according to Hiesinger. First, it determines how to get the “trillion parameters” into the network. And second, the evolutionary programming of the genome underlying development is the “learning procedure” that efficiently “runs” in hardware whose precise details are unknown.
“In fact, the phrasing of a ‘procedure that can run efficiently in hardware whose precise details are unknown’ falls back to exactly what Hinton questions: a separation of the procedure (software) from the hardware,” Hiesinger said. “In biology, the precise details of the hardware are the outcome of a long information-encoding process, evolutionarily programmed and unfolded through a growth process.”
The efficiency of biological learning lies in this gradual information-encoding process: There is no end to the development of the hardware and to the beginning of the running of learning procedures. Instead, information encoding in the network, including subcellularly and molecularly, occurs continuously first without, later with neuronal activity.
“Learning at the neuronal and synaptic level already occurs as a purely genetically encoded process, prior to the processing of environmental information,” Hiesinger said. “There is no ‘on’ switch, only gradually increasing information encoding that results in a network ‘whose precise details are unknown’ only from the perspective of learning that ignores the information encoded prior to feeding data into the network.”
The precise details of a given state of the biological hardware are so difficult to know because it would take so much information to describe it all. And there is no agreement on what is relevant.
“In the current AI, nothing matters but the synaptic weight. But many aspects of synaptic function cannot be simulated without the properties, dynamics and function in time, myriads of individual molecules that make the synapse react in a way that the up- or downregulation of a synaptic weight through gradient descent does not,” Hiesinger said. “Hinton is in search of that missing information.”