Reviews

Understanding the AI alignment problem

January 18, 2021

Welcome to AI book reviews, a series of posts that explore the latest literature on artificial intelligence.

For decades, we’ve been trying to develop artificial intelligence in our own image. And at every step of the way, we’ve managed to create machines that can perform marvelous feats and at the same time make surprisingly dumb mistakes.

After six decades of research and development, aligning AI systems with our goals, intents, and values continues to remain an elusive objective. Every major field of AI seems to solve part of the problem of replicating human intelligence while leaving out holes in critical areas. And these holes become problematic when we apply current AI technology to areas where we expect intelligent agents to act with the rationality and logic we expect from humans.

In his latest book, The Alignment Problem: Machine Learning and Human Values, programmer and researcher Brian Christian discusses the challenges of making sure our AI models capture “our norms and values, understand what we mean or intend, and, above all, do what we want.” This is an issue that has become increasingly urgent in recent years, as machine learning has found its way into many fields and applications where making wrong decisions can have disastrous consequences.

As Christian describes: “As machine-learning systems grow not just increasingly pervasive but increasingly powerful, we will find ourselves more and more often in the position of the ‘sorcerer’s apprentice’: we conjure a force, autonomous but totally compliant, give it a set of instructions, then scramble like mad to stop it once we realize our instructions are imprecise or incomplete—lest we get, in some clever, horrible way, precisely what we asked for.”

In The Alignment Problem, Christian provides a thorough depiction of the current state of artificial intelligence and how we got here. He also discusses what’s missing in different approaches to creating AI.

Here are some key takeaways from the book.

Machine learning: mapping inputs to outputs

In the earlier decades of AI research, symbolic systems made remarkable inroads in solving complicated problems that required logical reasoning. Yet they were terrible at simple tasks that every human learns at a young age, such as detecting objects, people, voices, and sounds. They also didn’t scale well and required a lot of manual effort to create the rules and knowledge that defined their behavior.

More recently, growing interest in machine learning and deep learning have helped advance computer vision, speech recognition, and natural language processing, the very fields that symbolic AI struggled at. Machine learning algorithms scale well with the availability of data and compute resources, which is largely why they’ve become so popular in the past decade.

But despite their remarkable achievements, machine learning algorithms are at their core complex mathematical functions that map observations to outcomes. Therefore, they’re as good as their data and they start to break as the data they face in the world starts to deviate from examples they’ve seen during training.

In The Alignment Problem, Christian goes through many examples where machine learning algorithms have caused embarrassing and damaging failures. A popular example is a Google Photos classification algorithm that tagged dark-skinned people as gorillas. The problem was not with the AI algorithm but with the training data. Had Google trained the model on more examples of people with dark skin, it could have avoided the disaster.

“The problem, of course, with a system that can, in theory, learn just about anything from a set of examples is that it finds itself, then, at the mercy of the examples from which it’s taught,” Christian writes.

What’s worse is that machine learning models can’t tell right from wrong and make moral decisions. Whatever problem exists in a machine learning model’s training data will be reflected in the model’s behavior, often in nuanced and inconspicuous ways. For instance, in 2018, Amazon shut down a machine learning tool used in making hiring decisions because its decisions were biased against women. Obviously, none of the AI’s creators wanted the model to select candidates based on their gender. In this case, the model, which was trained on the company’s historical hiring data, reflected problems within Amazon itself.

This is just one of the several cases where a machine learning model has picked up biases that existed in its training data and amplified them in its own unique ways. It is also a warning against trusting machine learning models that are trained on data we blindly collect from our own past behavior.

“Modeling the world as it is is one thing. But as soon as you begin using that model, you are changing the world, in ways large and small. There is a broad assumption underlying many machine-learning models that the model itself will not change the reality it’s modeling. In almost all cases, this is false,” Christian writes. “Indeed, uncareful deployment of these models might produce a feedback loop from which recovery becomes ever more difficult or requires ever greater interventions.”

Human intelligence has a lot to do with gathering data, finding patterns, and turning those patterns into actions. But while we usually try to simplify intelligent decision-making into a small set of inputs and outputs, the challenges of machine learning show that our assumptions about data and machine learning often turn out to be false.

“We need to consider critically… not only where we get our training data but where we get the labels that will function in the system as a stand-in for ground truth. Often the ground truth is not the ground truth,” Christian warns.

Reinforcement learning: maximizing rewards

OpenAI dota 2 reinforcement learning — Reinforcement learning has helped researchers create AI that achieves remarkable feats such as beating champions at complicated video games.

Another branch of AI that has gained much traction in the past decade is reinforcement learning, a subset of machine learning in which the model is given the rules of a problem space and a reward function. The model is then left to explore the space for itself and find ways to maximize its rewards.

“Reinforcement learning… offers us a powerful, and perhaps even universal, definition of what intelligence is,” Christian writes. “If intelligence is, as computer scientist John McCarthy famously said, ‘the computational part of the ability to achieve goals in the world,’ then reinforcement learning offers a strikingly general toolbox for doing so. Indeed it is likely that its core principles were stumbled into by evolution time and again—and it is likely that they will form the bedrock of whatever artificial intelligences the twenty-first century has in store.”

Reinforcement learning is behind great scientific achievements such as AI systems that have mastered Atari games, Go, StarCraft 2, and DOTA 2. It has also found many uses in robotics. But each of those achievements also proves that purely pursuing external rewards is not exactly how intelligence works.

For one thing, reinforcement learning models require massive amounts of training cycles to obtain simple results. For this very reason, research in this field has been limited to a few labs that are backed by very wealthy companies. Reinforcement learning systems are also very rigid. For instance, a reinforcement learning model that plays StarCraft 2 at championship level won’t be able to play another game with similar mechanics. Reinforcement learning agents also tend to get stuck in meaningless loops that maximize a simple reward at the expense of long-term goals. An example is this boat-racing AI that has managed to hack its environment by continuously collecting bonus items without considering the greater goal of winning the race.

“Unplugging the hardwired external rewards may be a necessary part of building truly general AI: because life, unlike an Atari game, emphatically does not come pre-labeled with real-time feedback on how good or bad each of our actions is,” Christian writes. “We have parents and teachers, sure, who can correct our spelling and pronunciation and, occasionally, our behavior. But this hardly covers a fraction of what we do and say and think, and the authorities in our life do not always agree. Moreover, it is one of the central rites of passage of the human condition that we must learn to make these judgments by our own lights and for ourselves.”

Christian also suggests that while reinforcement learning starts with rewards and develops behavior that maximizes those rewards, the reverse is perhaps even more interesting and critical: “Given the behavior we want from our machines, how do we structure the environment’s rewards to bring that behavior about? How do we get what we want when it is we who sit in the back of the audience, in the critic’s chair—we who administer the food pellets, or their digital equivalent?”

Should AI imitate humans

machine learning artificial intelligence

In The Alignment Problem, Christian also discusses the implications of developing AI agents that learn through pure imitation of human actions. An example is self-driving cars that learn by observing how humans drive.

Imitation can do wonders, especially in problems where the rules and labels are not clear-cut. But again, imitation paints an incomplete picture of the intelligence puzzle. We humans learn a lot through imitation and rote learning, especially at a young age. But imitation is but one of several mechanisms we use to develop intelligent behavior. As we observe the behavior of others, we also adapt our own version of that behavior that is aligned with our own limits, intents, goals, needs, and values.

“If someone is fundamentally faster or stronger or differently sized than you, or quicker-thinking than you could ever be, mimicking their actions to perfection may still not work,” Christian writes. “Indeed, it may be catastrophic. You’ll do what you would do if you were them. But you’re not them. And what you do is not what they would do if they were you.”

In other cases, AI systems use imitation to observe and predict our behavior and try to assist us. But this too presents a challenge. AI systems are not bound by the same constraints and limits as we are, and they often misinterpret our intentions and what’s good for us. Instead of protecting us against our bad habits, they amplify them and they push us toward acquiring the bad habits of others. And they’re becoming pervasive in every aspect of our lives.

“Our digital butlers are watching closely,” Christian writes. “They see our private as well as our public lives, our best and worst selves, without necessarily knowing which is which or making a distinction at all. They by and large reside in a kind of uncanny valley of sophistication: able to infer sophisticated models of our desires from our behavior, but unable to be taught, and disinclined to cooperate. They’re thinking hard about what we are going to do next, about how they might make their next commission, but they don’t seem to understand what we want, much less who we hope to become.”

What comes next?

Advances in machine learning show how far we’ve come toward the goal of creating thinking machines. But the challenges of machine learning and the alignment problem also remind us of how much more we have to learn before we can create human-level intelligence.

AI scientists and researchers are exploring several different ways to overcome these hurdles and create AI systems that can benefit humanity without causing harm. Until then, we’ll have to tread carefully and beware of how much credit we assign to systems that mimic human intelligence on the surface.

“One of the most dangerous things one can do in machine learning—and otherwise—is to find a model that is reasonably good, declare victory, and henceforth begin to confuse the map with the territory,” Christian warns.

Subscribe to TechTalks

Why LLMs should stop thinking out loud (and what comes after…

Beyond vibe coding: How Codev 3.0 engineers the AI-powered dev team

How Cursor’s Composer 2.5 uses self-distillation to beat the frontier LLMs…

Vertical integration as AI infrastructure: What 21D’s full arch implant system…

Why sandboxing OpenClaw doesn’t stop data exfiltration

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Understanding the AI alignment problem

Machine learning: mapping inputs to outputs

Reinforcement learning: maximizing rewards

Should AI imitate humans

What comes next?

Like this:

Leave a ReplyCancel reply

Machine learning: mapping inputs to outputs

Reinforcement learning: maximizing rewards

Should AI imitate humans

What comes next?

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks