This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.
The principle of “the wisdom of the crowd” shows that a large group of people with average knowledge on a topic can provide reliable answers to questions such as predicting quantities, spatial reasoning, and general knowledge. The aggregate results cancel out the noise and can often be superior to those of highly knowledgeable experts. The same rule can apply to artificial intelligence applications that rely on machine learning, the branch of AI that predicts outcomes based on mathematical models.
In machine learning, crowd wisdom is achieved through ensemble learning. For many problems, the result obtained from an ensemble, a combination of machine learning models, can be more accurate than any single member of the group.
How does ensemble learning work?
Say you want to develop a machine learning model that predicts inventory stock orders for your company based on historical data you have gathered from previous years. You use train four machine learning models using a different algorithms: linear regression, support vector machine, a regression decision tree, and a basic artificial neural network. But even after much tweaking and configuration, none of them achieves your desired 95 percent prediction accuracy. These machine learning models are called “weak learners” because they fail to converge to the desired level.
But weak doesn’t mean useless. You can combine them into an ensemble. For each new prediction, you run your input data through all four models, and then compute the average of the results. When examining the new result, you see that the aggregate results provide 96 percent accuracy, which is more than acceptable.
The reason ensemble learning is efficient is that your machine learning models work differently. Each model might perform well on some data and less accurately on others. When you combine all them, they cancel out each other’s weaknesses.
You can apply ensemble methods to both predictions problems, like the inventory prediction example we just saw, and classification problems, such as determining whether a picture contains a certain object.
For a machine learning ensemble, you must make sure your models are independent of each other (or as independent of each other as possible). One way to do this is to create your ensemble from different algorithms, as in the above example.
Another ensemble method is to use instances of the same machine learning algorithms and train them on different data sets. For instance, you can create an ensemble composed of 12 linear regression models, each trained on a subset of your training data.
There are two key methods for sampling data from your training set. “Bootstrap aggregation,” aka “bagging,” takes random samples from the training set “with replacement.” The other method, “pasting,” draws samples “without replacement.”
To understand the difference between the sampling methods, here’s an example. Say you have a training set with 10,000 samples and you want to train each machine learning model in your ensemble with 9,000 samples. In case you’re using bagging, for each of your machine learning models, you take the following steps:
- Draw a random sample from the training set.
- Add a copy of the sample to the model’s training set
- Return the sample to the original training set
- Repeat the process 8,999 times
When using pasting, you go through the same process, with the difference that samples are not returned to the training set after being drawn. Consequently, the same sample might appear in a model’s several times when using bagging but only once when using pasting.
After training all your machine learning models, you’ll have to choose an aggregation method. If you’re tackling a classification problem, the usual aggregation method is “statistical mode,” or the class that is predicted more than others. In regression problems, ensembles usually use the average of the predictions made by the models.
Another popular ensemble technique is “boosting.” In contrast to classic ensemble methods, where machine learning models are trained in parallel, boosting methods train them sequentially, with each new model building up on the previous one and solving its inefficiencies.
AdaBoost (short for “adaptive boosting”), one of the more popular boosting methods, improves the accuracy of ensemble models by adapting new models to the mistakes of previous ones. After training your first machine learning model, you single out the training examples misclassified or wrongly predicted by the model. When training the next model, you put more emphasis on these examples. This results in a machine learning model that performs better where the previous one failed. The process repeats itself for as many models you want to add to the ensemble. The final ensemble contains several machine learning models of different accuracies, which together can provide better accuracy. In boosted ensembles, the output of each model is given a weight that is proportionate to its accuracy.
One area where ensemble learning is very popular is decision trees, a machine learning algorithm that is very useful because of its flexibility and interpretability. Decision trees can make predictions on complex problems, and they can also trace back their outputs to a series of very clear steps.
The problem with decision trees is that they don’t create smooth boundaries between different classes unless you break them down into too many branches, in which case they become prone to “overfitting,” a problem that occurs when a machine learning model performs very well on training data but poorly on novel examples from the real world.
This is a problem that can be solved through ensemble learning. Random forests are machine learning ensembles composed of multiple decision trees (hence the name “forest”). Using random forests ensures that a machine learning model does not get caught up in the specific confines of a single decision tree.
Random forests have their own independent implementation in Python machine learning libraries such as scikit-learn.
Challenges of ensemble learning
While ensemble learning is a very powerful tool, it also has some tradeoffs.
Using ensemble means you must spend more time and resources on training your machine learning models. For instance, a random forest with 500 trees provides much better results than a single decision tree, but it also takes much more time to train. Running ensemble models can also become problematic if the algorithms you use require a lot of memory.
Another problem with ensemble learning is explainability. While adding new models to an ensemble can improve its overall accuracy, it makes it harder to investigate the decisions made by the AI algorithm. A single machine learning models such as decision tree is easy to trace, but when you have hundreds of models contributing to an output, it is much more difficult to make sense of the logic behind each decision.
As with most everything you’ll encounter in machine learning, ensemble is one among the many tools you have for solving complicated problems. It can get you out of difficult situations, but it’s not a silver bullet. Use it wisely.