The real “Bitter Lesson” of artificial intelligence

Quantum computer
Image source: 123RF

In a popular blog post titled “The Bitter Lesson,” Richard Sutton argues that AI’s progress has resulted from cheaper computation, not human design decisions based on problem-specific information. Sutton diminishes researchers that build knowledge into solutions based on their understanding of a problem to improve performance. This temptation, Sutton explains, is good for short-term performance gains, and such vanity is satisfying to the researcher. However, such human ingenuity comes at the expense of AI’s divine destiny by inhibiting the development of a solution that doesn’t want our help understanding a problem. AI’s goal is to recreate the problem-solver ex nihilo, not to solve problems directly.[1]

In a rebuttal titled “The Better Lesson,” roboticist Rodney Brooks argues that Sutton’s examples of general solutions that ignore human ingenuity in favor of “brute force” computation are [in fact] the result of human ingenuity. Brooks provides examples of human ingenuity, such as network architectures like convolutional neural networks and computational features like translational invariance, effectively arguing that the field swaps failed human ingenuity for more successful human ingenuity. Brooks’ eco-friendly argument suggests that having an artificial intelligence learn everything “seems pedantic to the extreme” and adds that learning everything “will drive up the computational costs… by many orders of magnitude.”

In one such example, Sutton writes, “the methods that defeated the world champion, [Garry] Kasparov, in 1997, were based on massive, deep search.” It may be true that deeper search results in higher performance since more chess moves would be evaluated at each step in a game. However, there are diminishing returns, and not everyone is interested in exchanging future theoretical performance gains for higher electric bills. Still, the claim that Deep Blue had no human knowledge built into the system is false. The “massive, deep search” was explicitly designed for games, primed with opening moves provided by grandmasters, and the evaluation function was designed by humans for chess and not known to the search algorithm.

Rather than advocating for deeper search, a more interesting question for researchers is how humans can be so competitive in a game where we don’t use search, let alone deep search, to evaluate millions of moves per second. Human chess players may analyze ten moves by anticipating advantages and working backward to find sequences of moves. Unfortunately, the question of how humans can be so competitive is ignored because so few in the field care. Sutton makes this point and urges researchers to stop “building in how we think we think,” which he says “does not work in the long run.” This declaration is strange since Sutton is one of the founders of modern computational reinforcement learning, which developed from the now-defunct theory of the mind called behaviorism. Behaviorism reduces mental states to physical states and purges the mind in favor of environmental stimuli and behavioral responses. In other words, reinforcement learning was a response to how we thought we thought. I suppose Sutton means it is okay to build in how we think we think if those theories are defunct.[2]

Like Ray Kurzweil, Sutton gives credit for all future advancements in AI to Moore’s law. However, the Moore’s-law-hence-AI argument is a category mistake, not to mention anti-intellectual. It is a category mistake because advances in artificial intelligence require more than advances in computation.

This thinking ignores the philosophical, ontological, and technological breakthroughs needed to achieve true artificial intelligence, which will not be achieved through brute force search or learning. It is anti-intellectual because it purges all research in favor of Moore’s law. Such a statement would point to Yann LeCun, Geoffrey Hinton, and countless other researchers and declare that their work wasn’t groundbreaking. Instead, they were saved by the grace of Moore’s law, which suggests that the next generation stands still and do nothing.

Incorporating problem-specific information into a solution comes at the expense of scale if scale is considered all other problems.[3] However, most businesses only need to solve some problems, mainly the problems their customers face. There is no such thing as a general-purpose company, so there is little need for a general-purpose solution. This is the bitter business lesson because problem-specific information helps you get your solution out the door faster. Additionally, customers care about stable value propositions. Machine learning is probabilistic and have unstable value propositions, and is still comically brittle despite benefiting from additional computational power. Therefore, a more stable system using non-probabilistic elements operating in the background or foreground provides a more durable solution for customers and more stable value propositions. Also, machine learning-only solutions cannot be fixed in a precise way. You can improve the training data, collect more data, tune parameters, or substitute one solution for another to improve performance. However, if you need something specifically solved and didn’t want to replace the entire system, you cannot guarantee improvement unless you use non-probabilistic elements. The catch is that non-probabilistic elements do not scale past what can be held in the heads of a few programmers and should be used cautiously.[4] The point is that your solution doesn’t need to know or want to solve a problem, but you and your team do.

Sutton’s article is short (a relatively concise 1,116 words), making me wonder why. Why not make a clever self-referential statement to the reader like the one he advocates for in the article by waiting to write an article that would generalizes to all possible questions? Instead, Sutton merely shares his unique perspective on a specific topic based on his knowledge that took a career to accumulate. Instructional frivolity aside, I hope we can embrace the truth that science requires scientists and not transfer all potential to Moore’s law.[5] It is a mistake to do nothing until we can do everything.[6]

[1] Not to suggest that Sutton is no trailblazer, but this argument is not new. AI pioneer Marvin Minsky shared a similar notion when he recalled his approach to AI in a biography by The New Yorker in 1981, saying, “I mustn’t tell the machine exactly what to do.”

[2] AI research is almost exclusively interested in maintaining the, albeit false, computational metaphors of the mind.

[3] Actually, a lot more: “The Bitter Lesson” advocates for less, maybe no, consideration of safety, security, privacy, fairness, and transparency because it has one singular goal in mind. This kind of advocacy has destructive elements.

[4] For more information on hybrid systems:

[5] In an especially salient example of hubris is Meta’s newest language model
“Galactica” which survived only three days:

[6] I thought I was borrowing from Edmund Burke, but it appears that Sydney Smith deserves credit for my inspiration: “Nobody made a greater mistake than he who did nothing because he could do only a little.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.