Blog

How reinforcement learning chooses the ads you see

February 22, 2021

This article is part of “Deconstructing artificial intelligence,” a series of posts that explore the details of how AI applications work.

Every day, digital advertisement agencies serve billions of ads on news websites, search engines, social media networks, video streaming websites, and other platforms. And they all want to answer the same question: Which of the many ads they have in their catalog is more likely to appeal to a certain viewer? Finding the right answer to this question can have a huge impact on revenue when you are dealing with hundreds of websites, thousands of ads, and millions of visitors.

Fortunately (for the ad agencies, at least), reinforcement learning, the branch of artificial intelligence that has become renowned for mastering board and video games, provides a solution. Reinforcement learning models seek to maximize rewards. In the case of online ads, the RL model will try to find the ad that users are more likely to click on.

The digital ad industry generates hundreds of billions of dollars every year and provides an interesting case study of the powers of reinforcement learning.

Naïve A/B/n testing

To better understand how reinforcement learning optimizes ads, consider a very simple scenario: You’re the owner of a news website. To pay for the costs of hosting and staff, you have entered a contract with a company to run their ads on your website. The company has provided you with five different ads and will pay you one dollar every time a visitor clicks on one of the ads.

Your first goal is to find the ad that generates the most clicks. In advertising lingo, you will want to maximize your click-through rate (CTR). The CTR is ratio of clicks over number of ads displayed, also called impressions. For instance, if 1,000 ad impressions earn you three clicks, your CTR will be 3 / 1000 = 0.003 or 0.3 percent.

Before we solve the problem with reinforcement learning, let’s discuss A/B testing, the standard technique for comparing the performance of two competing solutions (A and B) such as different webpage layouts, product recommendations, or ads. When you’re dealing with more than two alternatives, it is called A/B/n testing.

In A/B/n testing, the experiment’s subjects are randomly divided into separate groups and each is provided with one of the available solutions. In our case, this means that we will randomly show one of the five ads to each new visitor of our website and evaluate the results.

Say we run our A/B/n test for 100,000 iterations, roughly 20,000 impressions per ad. Here are the clicks-over-impression ratio of our ads:

Ad 1: 80/20,000 = 0.40% CTR

Ad 2: 70/20,000 = 0.35% CTR

Ad 3: 90/20,000 = 0.45% CTR

Ad 4: 62/20,000 = 0.31% CTR

Ad 5: 50/20,000 = 0.25% CTR

Our 100,000 ad impressions generated $352 in revenue with an average CTR of 0.35%. More importantly, we found out that ad number 3 performs better than the others, and we will continue to use that one for the rest of our viewers. With the worst performing ad (ad number 2), our revenue would have been $250. With the best performing ad (ad number 3), our revenue would have been $450. So, our A/B/n test provided us with the average of the minimum and maximum revenue and yielded the very valuable knowledge of the CTR rates we sought.

Digital ads have very low conversion rates. In our example, there’s a subtle 0.2-percent difference between our best- and worst-performing ads. But this difference can have a significant impact at scale. At 1,000 impressions, ad number 3 will generate an extra $2 in comparison to ad number 5. At a million impressions, this difference will become $2,000. When you’re running billions of ads, a subtle 0.2 percent can have a huge impact on revenue.

Therefore, finding these subtle differences is very important in ad optimization. The problem with A/B/n testing is that it is not very efficient at finding these differences. It treats all ads equally and you need to run each ads tens of thousands of times until you discover their differences at a reliable confidence level. This can result in lost revenue, especially when you have a larger catalog of ads.

Another problem with classic A/B/n testing is that it is static. Once you find the optimal ad, you will have stick to it. If the environment changes due to a new factor (seasonality, news trends, etc.) and causes one of the other ads to have a potentially higher CTR, you won’t find out unless you run the A/B/n test all over again.

What if we could change A/B/n testing to make it more efficient and dynamic?

This is where reinforcement learning comes into play. A reinforcement learning agent starts by knowing nothing about its environment actions, rewards, and penalties. The agent must find a way to maximize its rewards.

In our case, the RL agent’s actions are one of five ads to display. The RL agent will receive a reward point every time a user clicks on an ad. It must find a way to maximize ad clicks.

The multi-armed bandit

In some reinforcement learning environments, actions are evaluated in sequences. For instance, in video games, you must perform a series of actions to reach the reward, which is finishing a level or winning a match. But when serving ads, the outcome of every ad impression is evaluated independently; it is a single-step environment.

To solve the ad optimization problem, we’ll use a “multi-armed bandit” (MAB), a reinforcement learning algorithm that is suited for single-step reinforcement learning. The name of the multi-armed bandit comes from an imaginary scenario in which a gambler is standing at a row of slot machines. The gambler knows that the machines have different win rates, but he doesn’t know which one provides the highest reward.

If he sticks to one machine, he might lose the chance of selecting the machine with the highest win rate. Therefore, the gambler must find an efficient way to discover the machine with the highest reward without using up too much of his tokens.

Ad optimization is a typical example of a multi-armed bandit problem. In this case, the reinforcement learning agent must find a way to discover the ad with the highest CTR without wasting too much valuable ad impressions on inefficient ads.

Exploration vs exploitation

One of the problems every reinforcement learning model faces is the “exploration vs exploitation” challenge. Exploitation means sticking to the best solution the RL agent has so far found. Exploration means trying other solutions in hopes of landing on one that is better than the current optimal solution.

In the context of ad selection, the reinforcement learning agent must decide between choosing the best-performing ad and exploring other options.

One solution to the exploitation-exploration problem is the “epsilon-greedy” (ε-greedy) algorithm. In this case, the reinforcement learning model will choose the best solution most of the time, and in a specified percent of cases (the epsilon factor) it will choose one of the ads at random.

exploration vs exploitation — Every reinforcement learning algorithm must find the right balance between exploiting optimal solutions and exploring new options

Here’s how it works in practice. Say we have an epsilon-greedy MAB agent with the ε factor set to 0.2. This means that the agent chooses the best-performing ad 80 percent of the time and explores other options 20 percent of the time.

The reinforcement learning model starts without knowing which of the ads performs better, therefore it assigns each of them an equal value. When all ads are equal, it will choose one of them at random each time it wants to serve an ad.

After serving 200 ads (40 impressions per ad), a user clicks on ad number 4. The agent adjusts the CTR of the ads as follows:

Ad 1: 0/40 = 0.0%

Ad 2: 0/40 = 0.0%

Ad 3: 0/40 = 0.0%

Ad 4: 1/40 = 2.5%

Ad 5: 0/40 = 0.0%

Now, the agent thinks that ad number 4 is the top performing ad. For every new ad impression, it will pick a random number between 0 and 1. If the number is above 0.2 (the ε factor), it will choose ad number 4. If it’s below 0.2, it will choose one of the other ads at random.

Now, our agent runs 200 other ad impressions before another user clicks on an ad, this time on ad number 3. Note that of these 200 impressions, 160 belong to ad number 4, because it was the optimal ad. The rest are equally divided between the other ads. Our new CTR values are as follows:

Ad 1: 0/50 = 0.0%

Ad 2: 0/50 = 0.0%

Ad 3: 1/50 = 2.0%

Ad 4: 1/200 = 0.5%

Ad 5: 0/50 = 0.0%

Now the optimal ad becomes ad number 3. It will get 80 percent of the ad impressions. Let’s say after another 100 impressions (80 for ad number three, four for each of the other ads), someone clicks on ad number 2. Here’s how what the new CTR distribution looks like:

Ad 1: 0/54 = 0.0%

Ad 2: 1/54 = 1.8%

Ad 3: 1/130 = 0.7%

Ad 4: 1/204 = 0.49%

Ad 5: 0/54 = 0.0%

Now, ad number 2 is the optimal solution. As we serve more ads, the CTRs will reflect the real value of each ad. The best ad will get the lion’s share of the impressions, but the agent will continue to explore other options. Therefore, if the environment changes and users start to show more positive reactions to a certain ad, the RL agent can discover it.

After running 100,000 ads, our distribution can look something like the following:

Ad 1: 123/30,600 = 0.40% CTR

Ad 2: 67/18,900 = 0.35% CTR

Ad 3: 187/41,400 = 0.45% CTR

Ad 4: 35/11,300 = 0.31% CTR

Ad 5: 15/5,800 = 0.26% CTR

With the ε-greedy algorithm, we were able to increase our revenue from $352 to $426 on 100,000 ad impression and an average CTR of 0.42 percent. This is a great improvement over the classic A/B/n testing model.

Improving the ε-greedy algorithm

The key to the ε-greedy reinforcement learning algorithm is adjusting the epsilon factor. If you set it too low, it will exploit the ad which it thinks is optimal at the expense of not finding a possibly better solution. For instance, in the example we explored above, ad number four happens to generate the first click, but in the long run, it doesn’t have the highest CTR. Small sample sizes do not necessarily represent true distributions.

On the other hand, if you set the epsilon factor too high, your RL agent will waste too many resources exploring non-optimal solutions.

One way you can improve the epsilon-greedy algorithm is defining a dynamic policy. When the MAB model is fresh, you can start with a high epsilon value to do more exploration and less exploitation. As your model serves more ads and gets a better estimate of the value of each solution, it can gradually reduce the epsilon value until it reaches a threshold value.

In the context of our ad-optimization problem, we can start with an epsilon value of 0.5 and reduce it by 0.01 after every 1,000 ad impression until it reaches 0.1.

Another way to improve our multi-armed bandit is to put more weight on new observations and gradually reduces the value of older observations. This is especially useful in dynamic environments such as digital ads and product recommendations, where the value of solutions can change over time.

Here’s a very simple way you can do this. The classic way to update the CTR after serving an ad is as follows:

(result + past_results) / impressions

Here, result is the outcome of the ad displayed (1 if clicked, 0 if not clicked), past_results is the cumulative number of clicks the ad has garnered so far, and impressions is the total number of times the ad has been served.

To gradually fade old results, we add a new alpha factor (between 0 and 1), and make the following change:

(result + past_results * alpha) / impressions

This small change will give more weight to new observations. Therefore, if you have two competing ads that have an equal number of clicks and impressions, the one whose clicks are more recent will be favored by your reinforcement learning model. Also, if an ad had a very high CTR rate in the past but has become unresponsive in recent times, its value will decline faster in this model, forcing the RL model to move to other alternatives earlier and waste less resources on the inefficient ad.

Adding context to the reinforcement learning model

contextual bandit — Contextual bandits use function approximation to factor in the individual characteristics of ad viewers

In the age of internet, websites, social media, and mobile apps have plenty of information on every single user such as their geographic location, device type, and the exact time of day they’re viewing the ad. Social media companies have even more information about their users, including age and gender, friends and family, the type of content they have shared in the past, the type of posts they liked or clicked on in the past, and more.

This rich information gives these companies the opportunity to personalize ads for each viewer. But the multi-armed bandit model we created in the previous section shows the same ad to everyone and doesn’t take the specific characteristic of each viewer into account. What if we wanted to add context to our multi-armed bandit?

One solution is to create several multi-armed bandits, each for a specific sub-field of users. For instance, we can create separate RL models for users in North America, Europe, Middle East, Asia, Africa, and so on. What if we wanted to also factor in gender? Then we would have one reinforcement learning model for female users in North America, one for male users in North America, one for female users in Europe, male users in Europe, etc. Now, add age ranges and device types, and you can see that it will quickly develop into a big problem, creating an explosion of multi-armed bandits that become hard to train and maintain.

An alternative solution is to use a “contextual bandit,” an upgraded version of the multi-armed bandit that takes contextual information into account. Instead of creating a separate MAB for each combination of characteristics, the contextual bandit uses “function approximation,” which tries to model the performance of each solution based on a set of input factors.

Without going too much into the details (that could be the subject of another post), our contextual bandit uses supervised machine learning to predict the performance of each ad based on location, device type, gender, age, etc. The benefit of the contextual bandit is that it uses one machine learning model per ad instead of creating an MAB per combination of characteristics.

This wraps up our discussion of ad optimization with reinforcement learning. The same reinforcement learning techniques can be used to solve many other problems, such as content and product recommendation or dynamic pricing, and are used in other domains such as health care, investment, and network management.

Why LLMs should stop thinking out loud (and what comes after…

Beyond vibe coding: How Codev 3.0 engineers the AI-powered dev team

How Cursor’s Composer 2.5 uses self-distillation to beat the frontier LLMs…

Vertical integration as AI infrastructure: What 21D’s full arch implant system…

Why sandboxing OpenClaw doesn’t stop data exfiltration

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

OpenAI’s GPT-5: A reality check for the AI hype train

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

How reinforcement learning chooses the ads you see

Naïve A/B/n testing

The multi-armed bandit

Exploration vs exploitation

Improving the ε-greedy algorithm

Adding context to the reinforcement learning model

Like this:

1 COMMENT

Leave a ReplyCancel reply

Naïve A/B/n testing

The multi-armed bandit

Exploration vs exploitation

Improving the ε-greedy algorithm

Adding context to the reinforcement learning model

Like this:

1 COMMENT

Leave a ReplyCancel reply

Discover more from TechTalks