What is the “forward-forward” algorithm, Geoffrey Hinton’s new AI technique?

neural networks
Image credit: 123RF

This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.

In the 1980s, Geoffrey Hinton was one of the scientists who invented backpropagation, the algorithm that enables the training of deep neural networks. Backpropagation was key to the success of deep learning and its widespread use today.

But Hinton, who is one of the most celebrated artificial intelligence scientists of our time, thinks it is time that we think beyond backpropagation and look for other, more efficient ways to train neural networks. And like many other scientist, he draws inspiration from the human brain.

In a new paper presented at NeurIPS 2022, Hinton introduced the “forward-forward algorithm,” a new learning algorithm for artificial neural networks inspired by our knowledge about neural activations in the brain. Though still in early experimentation, forward-forward has the potential to replace backprop in the future, Hinton believes. The paper also proposes a new model for “mortal computation,” which brings us closer to the energy-efficient mechanisms of the brain and can support the forward-forward algorithm.

The problem with backpropagation

neural network undetectable backdoor
Image credit: 123RF (with modifications)

When a deep neural network is in training, it goes through two phases. First is the forward pass, in which the input features go through the layers of the model and are modified by the weights, activation functions, and other transformations to create a prediction. The prediction is then compared to the ground truth and the error is calculated.

In the second phase, backpropagation, the error is propagated in reverse order through the layers and the weights of artificial neurons are corrected. Backpropagation uses the partial derivatives and the chain rule to calculate gradients and adjust each artificial neuron based on its contribution to the error.

While backpropagation is extremely useful, it is very different from what we know about the brain.

“As a model of how cortex learns, backpropagation remains implausible despite considerable effort to invent ways in which it could be implemented by real neurons,” Hinton writes. “There is no convincing evidence that cortex explicitly propagates error derivatives or stores neural activities for use in a subsequent backward pass.”

For example, in the visual system, the connections between different cortical areas do not mirror the bottom-up connections of backpropagation-based deep learning models. Instead, they go in loops, in which neural signals traverse several cortical layers and then return to earlier areas.

One of the main limits of backpropagation is the detachment of learning and inference. To adjust the weights of a neural network, the training algorithm must stop inference to perform backpropagation. In the real world, the brain receives a constant stream of information, and “the perceptual system needs to perform inference and learning in real time without stopping to perform backpropagation,” Hinton writes.

Backpropagation also doesn’t work if the computation done in the forward pass is not differentiable. “If we insert a black box into the forward pass, it is no longer possible to perform backpropagation unless we learn a differentiable model of the black box,” Hinton writes.

Where backpropagation is not possible, reinforcement learning can serve as an alternative algorithm for training neural networks. But reinforcement learning is computationally expensive, slow, and unstable when the neural network contains millions or billions of parameters.

The forward-forward algorithm

visual cortex vs neural networks
In the visual cortex (right), information moves in several directions. In neural networks (left), information moves in one direction.

The idea behind the forward-forward algorithm is to replace the forward and backward passes of backpropagation with two forward passes. The two passes are similar, but they work on different data and have opposite objectives.

The “positive pass” operates on real data and adjusts the weights of the network to increase the “goodness” of each layer. The “negative pass” operates on negative data and adjusts the weights to reduce goodness.

In the paper, Hinton measures goodness as the sum of squared neural activities and the negative sum of squared activities. The learning aims to adjust the neural network’s weights in a way that is above a certain threshold for real examples and below the threshold for negative data.

During training, the examples and their labels are merged into a single vector to create positive data. To create negative data, the same process is used with the slight difference that a false label is attached to the example. After the data is run through the neural network, the weights are adjusted to increase the goodness (e.g., sum of square of activations) for the positive examples and decrease it for the negative examples. No backpropagation is required.

This process works well for a neural network with a single hidden layer. For a multi-layer deep learning model, the output of each hidden layer is normalized before being passed on to the next one.

The FF algorithm can also benefit from some form of self-supervised learning that can make the neural networks more robust while reducing the need for manually labeled examples. In the paper, Hinton shows how simple masking techniques can create negative examples to train the model.

Unlike backpropagation, the FF algorithm also works if it contains black-box modules. Since the algorithm does not require differentiable functions, it can still tune its trainable parameters without knowing the inner workings of every layer in the model.

“The Forward-Forward algorithm (FF) is comparable in speed to backpropagation but has the advantage that it can be used when the precise details of the forward computation are unknown,” Hinton writes.

Testing the forward-forward algorithm

deep learning

Hinton tested the FF algorithm on a neural network composed of four fully connected layers, each containing 2,000 neurons and ReLU activation. The network was trained on the MNIST dataset for 60 epochs to reach 1.36 percent error rate. In comparison, backpropagation takes 20 epochs to achieve similar results. By doubling the learning rate, Hinton was able to reach 1.40 percent error rate in 40 epochs.

Hinton notes that the forward-forward algorithm “does not generalize quite as well on several of the toy problems investigated in this paper so it is unlikely to replace backpropagation for applications where power is not an issue.”

According to Hinton, two areas in which the forward-forward algorithm may be superior to backpropagation are as “a model of learning in cortex” and “as a way of making use of very low-power analog hardware without resorting to reinforcement learning.”

“It is nice to see Hinton still trying to draw inspiration from biological neural networks to find a way to make the AI more efficient,” said Daeyeol Lee, neuroscientist and the author of The Birth of Intelligence. “I think that new ideas like FF can stimulate the research on the fundamental computations performed by neurons and synapses.”

Meanwhile, Saty Raghavachary, associate professor of computer science at the University of Southern California, notes that while FF seems like an alternative to backprop, it needs to be further investigated.

“FF is biologically more plausible compared to BP. But in the bigger picture, it is simply an alternative way to learn existing patterns in captured data, and by itself, will not lead to human-like intelligence—given that it deals exclusively in (learning patterns in) human-generated, human-labled data,” Raghavachary said.

Mortal computers

brain artificial intelligence
Image credit: Depositphotos

In the paper, Hinton challenges the established computing paradigm and suggests a new direction.

“General purpose digital computers were designed to faithfully follow instructions because it was assumed that the only way to get a general purpose computer to perform a specific task was to write a program that specified exactly what to do in excruciating detail,” Hinton writes. “This is no longer true, but the research community has been slow to comprehend the long-term implications of deep learning for the way computers are built.”

Hinton specifically challenges the idea of separating software and hardware. While this separation has many benefits, including copying the same software to millions of computers, it also has limits.

“If… we are willing to abandon immortality it should be possible to achieve huge savings in the energy required to perform a computation and in the cost of fabricating the hardware that executes the computation,” Hinton writes. “We can allow large and unknown variations in the connectivity and non-linearities of different instances of hardware that are intended to perform the same task and rely on a learning procedure to discover parameter values that make effective use of the unknown properties of each particular instance of the hardware.”

Basically, this means that the software is gradually built on each hardware instance and is only useful for that instance. Accordingly, the software becomes “mortal,” which means it dies with the hardware. Hinton makes propositions on how a mortal computer can transfer its knowledge to another hardware instance by training it. Despite its disadvantages, mortal computation can have clear advantages, according to Hinton.

“If you want your trillion parameter neural net to only consume a few watts, mortal computation may be the only option,” Hinton writes, though he also asserts that its feasibility will depend on finding a learning procedure that can run efficiently in hardware whose precise details are unknown.

The forward-forward algorithm is a promising candidate, though it remains to be seen how well it scales to large neural networks, Hinton writes.

“I think the (false) analogy between digital computers and the brain has counterproductively promoted the idea that synapses might be analogous to transistors (binary switches), while in reality, synapses are much more complex,” Lee said. “Once people appreciate what the FF models can do, this might lead to some new ideas about the building blocks of cortical computations, including the feedback (top-down) connections in the brain. Given that the brain is full of recurrent and parallel connectivity, FF seems very plausible to implement in the brain.”

Raghavachary, however, believes that biological-like intelligence requires an architecture that can learn about the world by directly, physically, interactively and continuously (DPIC) engaging with the world in order to experience it non-symbolically.

“In sufficiently complex architectures, symbol processing (e.g., language acquisition) could occur, which would be ‘grounded’ via the underlying and complementary non-symbolic experience,” Raghavachary said. “FF seems to be an alternative way to do symbol processing, which is useful, but in my opinion, a full-blown AGI architecture would require the direct experience, non-symbolic, counterpart which only embodiment can provide.”

Biological learning

dna genome
Image credit: 123RF

Peter Robin Hiesinger, professor of neurobiology and author of The Self-Assembling Brain, points out that the key concern is information. In biology, there is a lot of information contained in biological neural networks prior to learning, he said.

“It is in the connectivity. It is in the molecular composition of each individual synaptic connection, which is so much more than just a ‘synaptic weight’. And it all got there—in biology anyways—through genetically encoded development prior to learning,” Hiesinger said.

Genetically encoded development solves two problems raised by Hinton, according to Hiesinger. First, it determines how to get the “trillion parameters” into the network. And second, the evolutionary programming of the genome underlying development is the “learning procedure” that efficiently “runs” in hardware whose precise details are unknown. 

“In fact, the phrasing of a ‘procedure that can run efficiently in hardware whose precise details are unknown’ falls back to exactly what Hinton questions: a separation of the procedure (software) from the hardware,” Hiesinger said. “In biology, the precise details of the hardware are the outcome of a long information-encoding process, evolutionarily programmed and unfolded through a growth process.”

The efficiency of biological learning lies in this gradual information-encoding process: There is no end to the development of the hardware and to the beginning of the running of learning procedures. Instead, information encoding in the network, including subcellularly and molecularly, occurs continuously first without, later with neuronal activity.  

“Learning at the neuronal and synaptic level already occurs as a purely genetically encoded process, prior to the processing of environmental information,” Hiesinger said. “There is no ‘on’ switch, only gradually increasing information encoding that results in a network ‘whose precise details are unknown’ only from the perspective of learning that ignores the information encoded prior to feeding data into the network.”  

The precise details of a given state of the biological hardware are so difficult to know because it would take so much information to describe it all.  And there is no agreement on what is relevant.

“In the current AI, nothing matters but the synaptic weight. But many aspects of synaptic function cannot be simulated without the properties, dynamics and function in time, myriads of individual molecules that make the synapse react in a way that the up- or downregulation of a synaptic weight through gradient descent does not,” Hiesinger said. “Hinton is in search of that missing information.”


  1. Interesting article. I’m sorry but I don’t see how FF can be an advance toward biologically plausible neural networks. Like conventional DL, FF does not generalize. Curve fitting is not generalization. Generalization is the ability of an intelligent system to perceive any object or pattern without recognizing it. An Amazon Indian, for example, can instantly perceive a bicycle even if he has never seen one before. He can instantly see its 3D shape, size, borders, colors, its various parts, its position relative to other objects, whether it is symmetrical, opaque, transparent or partially occluding, etc. He can perceive all these things because his brain has the ability to generalize. Moreover, his perception of the bicycle is automatically invariant to transformations in his visual field. Edges remain sharp and, if the bicycle is moved to another location or falls on the ground, he remains cognizant of the fact that he is still observing the same object after the transformation.

    By contrast, with either FF or DL, perception is impossible without recognition, i.e., without prior learned representations of the objects to be perceived. Automatic universal invariance is also non-existent. This is a fatal flaw if AGI is the goal.

    Back-propagation is merely a symptom of a much bigger problem in mainstream AI: the notion that learning consists of optimizing an objective function. Function optimization is the opposite of generalization. The brain can perceive anything without optimization. I believe that AGI research should focus exclusively on systematic generalization. That’s where almost all the research money should go in my opinion.

    In conclusion, I’m afraid that Geoffrey Hinton is just spinning his wheels. Same with the rest of the DL community. Fortunately, a few researchers are working on systematic generalization. The best times are still ahead of us.

    • You misrepresented the whole field of ML based on an arbitrary definition of intelligence. What’s more, you seem to believe systematic generalization is somehow a fundamentally different approach to AI. Hint: systematic generalization refers to a learning algorithm’s ability to extrapolate learned behavior to unseen situations that are distinct but semantically similar to its training data.

      Optimizing after an objective function is not inherently against the idea of generalization. It’s just that the optimization step is not the whole picture. An intelligent agent must also develop its own preferences to sample new data from the environment. It is true that backprop is not efficient for real time applications such as reinforcement learning (and hopefully in the future there will be more specialized optimization algorithms), but it is not fundamentally obstructive. What you can optimize with one algorithm, you can optimize with another, as long as both can navigate the same loss landscapes.

  2. Prof. Geoffrey Hinton did not create a Backpropagation (BP) algorithm. But surely, his work popularized it.

    Feedforward is a simplified architecture assumed for the actual complex brain architecture which forms feedback (as opposed to feedforward) and loop structures.

    In real-world, the precise details of the system is rarely know. BP won’t work if the activation function is not differentiable. The alternative Forward-forward algorithm proposed by Geoffrey Hinton, it seems at the moment, can be applied only to a small subset of problems as it has generalisation and scaling-up issues. Hopefully, it will get resolved soon.

    • Hinton is clueless about the brain, I’m sorry to say. The cortex uses massive feedback pathways for both learning and top-down attention purposes. However, feedback in the brain does not propagate error signals for gradient learning a la DL. It generates success signals using a winner-take-all mechanism. The signals are used not to modify weights, but to strengthen synaptic connections until they become permanent. It’s called STDP, a form of Hebbian learning.

      Furthermore, the brain does not optimize objective functions, a learning approach that is inherently and hopelessly non-generalizing. The brain discovers context, which is temporal at its core. Spike timing is essential to context. Thus generalized intelligence is context-bound. The work of Hinton and the rest of the DL community, while valuable to automation applications, is irrelevant to AGI. One man’s opinion, of course.

      • Any for form of discovery exercise can be phrased as an optimization problem (and therefore have an objective function). So, optimizing after objective functions is not inherently non-generalizing.

  3. Fitting NN’s to real problems datasets is way more cucumber-some and trouble rich wit a lot of hyperparameters guessing. NN’s can learn non linear multi parameters problems – yes on paper in real world problems with which accuracy and how many competent man power wasting how many time ? Until NN structure is “guessing” based there is more luck as science involved. We all are into data analysis and there is everything about the particular data and not all about NN’s – there developers are wrong or they are into wrong assumptions.
    NN’s are only the tools for help into data processing and analysis and mostly hard to use.

  4. So this seems to enjoin issues of building neurosymbolic computing using FF for hardware evolutionarily structured to causal input as pervasive. I think ENN’s (epistemic neural networks) will find once FF data is trained that Markov time series as pervasive to data training on accuracy models I.e. each year from 1960-1970 being trained one year at a time predicting each forward years events to establish accuracy at year 1969 predicts best result 1970 events in news. This suggest data is temporal spatial and to adjust learning that is biological in terms of a blood brain interface using corelet ibm software to mediate biological neuronal development and artificial neural domain mapping from a->b that we still lack a data time domain for evolutionary learning that attest moral computing where hardware/software are mutually evolutionary to for example a model of neurosynaptic cores used for a vision processing that can equate objects weights to embed object dates. Ultimately, shifting Boolean logic to compute models and energy modalities of a hardware specific to a language for a biological emulating process requires a neural compute revolution that in contrast insures dualistic biological tree parity planning of hardware to software. Thus substrate medium of computing methods quantum or neutosynaptic will need better parameters for time/space domain that still has yet to explore back propagation through structure/bavkpropgation throigh time asserting one is spatial one is time ordered. To fully understand the analogous model of how humans sense encode memories in neural biological evolution we should assert FF need to ascertain some domain encoded in training data to unify origin of data and date of creation of a data. While off topic a bit the recent stable diffusion ai technology using diffusion methods may yield better results when a name like Jesus is used to represent a character. He is widely trained on yet no time ordered sequence of social data in data training can be made to train origin of Jesus to 2,000 plus years of references thus his name in ai unless removed with pre training takes on a vague concept of frequency/time rate to suggest underlying religious concepts in society have epistemic roots that weighting values become deep belief driven and even when an author has data of a story for example he may be religious and belief in Jesus but his novels story may represent a character that he himself does not explicit state has s deep religious context. Weighting in this context of data adjusting using a BP method may argue that time is both flowing backward in terms of memory which is the embedding of learned events as knowledge itself in learning has to come with referential domains of rating good/bad as judgement and awareness equates consciousness.So how can negative/ positive weighting in FF DERIDE a instance training by a mortal computing doctrine to enjoin mutually constructed hardware/software as a duality of a learner I.e. a FF algorithm? I think data scientist need to look at time order accuracy models in training to suggest while a absurd notion that somewhere aged light I.e. research in aged light can create a metric to debate how Data can have a date of origin. Absurd yes, but conceptually to move toward an analog ai space where biological substrate and medium of grey matter or cellular substrates used to model unique life from synthetic biology —which is where FF would in my opinion best be developed, that we need to see data evolution has a time/space domain as intrinsic to energy laws of conservation. What does this mean? Simple…data training that sees data created after a date for example yesterday cannot be treated as data from 1845 in weighting not to text but a embedded value to date of origin. It suggest that the bp training is erroneously sorting data out of context of space/time origin I.e. a duality of structure/time such that a middle step to FF may be a hardware/software problem for neurosymbolic computing to equate how to blend bptt/bpts (back propagation through time/backpropgation through structure) using a blood brain interface language and syntax to weights that see mobility and mobilization of synthetic life that demands sense encoding of a neurobestibular senses as a common coding issue with training robotics to have a synthetic inferior parietal cortex for goal reasoning and location data. Not to mention most importantly when we encode memories in human biology we still don’t know at scale 10**16 if location data of neurons is mapped at a genetic recall level to physical content addressable memory as a mind map that suggest there is a get/set feature of neurons that while activation occurs to call memories for knowledge to plan a forward-forward decision that there may be a genetic universal neural weight/location map that states a memory recall goes to a location+weight. I think Hinton is right about knowledge in FF but are we excluding the bp pass that adjust in favor of a sort of biological contiguous learner who relies on some intrinsic epigenetic model. If so then are we looking at new substrate mediums that encode experiential data? That a data is born when it is made and can evolve if it only has a quaila of some base grounded paradigm I.e. it can be good or bad? If so this is a metaphysical debate on social perception of the good/bad as a moral story of ai seeking morality? Then we have to have bad? What a bummer….

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.