This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.
Machine learning has proven to be very efficient at classifying images and other unstructured data, a task that is very difficult to handle with classic rule-based software. But before machine learning models can perform classification tasks, they need to be trained on a lot of annotated examples. Data annotation is a slow and manual process that requires humans reviewing training examples one by one and giving them their right label.
In fact, data annotation is such a vital part of machine learning that the growing popularity of the technology has given rise to a huge market for labeled data. From Amazon’s Mechanical Turk to startups such as LabelBox, ScaleAI, and Samasource, there are dozens of platforms and companies whose job is to annotate data to train machine learning systems.
Fortunately, for some classification tasks, you don’t need to label all your training examples. Instead, you can use semi-supervised learning, a machine learning technique that can automate the data-labeling process with a bit of help.
Supervised vs unsupervised vs semi-supervised machine learning
You only need labeled examples for supervised machine learning tasks, where you must specify the ground truth for your AI model during training. Examples of supervised learning tasks include image classification, facial recognition, sales forecasting, customer churn prediction, and spam detection.
Unsupervised learning, on the other hand, deals with situations where you don’t know the ground truth and want to use machine learning models to find relevant patterns. Examples of unsupervised learning include customer segmentation, anomaly detection in network traffic, and content recommendation.
Semi-supervised learning stands somewhere between the two. It solves classification problems, which means you’ll ultimately need a supervised learning algorithm for the task. But at the same time, you want to train your model without labeling every single training example, for which you’ll get help from unsupervised machine learning techniques.
Semi-supervised learning with clustering and classification algorithms
One way to do semi-supervised learning is to combine clustering and classification algorithms. Clustering algorithms are unsupervised machine learning techniques that group data together based on their similarities. The clustering model will help us find the most relevant samples in our data set. We can then label those and use them to train our supervised machine learning model for the classification task.
Say we want to train a machine learning model to classify handwritten digits, but all we have is a large data set of unlabeled images of digits. Annotating every example is out of the question and we want to use semi-supervised learning to create your AI model.
First, we use k-means clustering to group our samples. K-means is a fast and efficient unsupervised learning algorithm, which means it doesn’t require any labels. K-means calculates the similarity between our samples by measuring the distance between their features. In the case of our handwritten digits, every pixel will be considered a feature, so a 20×20-pixel image will be composed of 400 features.
When training the k-means model, you must specify how many clusters you want to divide your data into. Naturally, since we’re dealing with digits, our first impulse might be to choose ten clusters for our model. But bear in mind that some digits can be drawn in different ways. For instance, here are different ways you can draw the digits 4, 7, and 2. You can also think of various ways to draw 1, 3, and 9.
Therefore, in general, the number of clusters you choose for the k-means machine learning model should be greater than the number of classes. In our case, we’ll choose 50 clusters, which should be enough to cover different ways digits are drawn.
After training the k-means model, our data will be divided into 50 clusters. Each cluster in a k-means model has a centroid, a set of values that represent the average of all features in that cluster. We choose the most representative image in each cluster, which happens to be the one closest to the centroid. This leaves us with 50 images of handwritten digits.
Now, we can label these 50 images and use them to train our second machine learning model, the classifier, which can be a logistic regression model, an artificial neural network, a support vector machine, a decision tree, or any other kind of supervised learning engine.
Training a machine learning model on 50 examples instead of thousands of images might sound like a terrible idea. But since the k-means model chose the 50 images that were most representative of the distributions of our training data set, the result of the machine learning model will be remarkable. In fact, the above example, which was adapted from the excellent book Hands-on Machine Learning with Scikit-Learn, Keras, and Tensorflow, shows that training a regression model on only 50 samples selected by the clustering algorithm results in a 92-percent accuracy (you can find the implementation in Python in this Jupyter Notebook). In contrast, training the model on 50 randomly selected samples results in 80-85-percent accuracy.
But we can still get more out of our semi-supervised learning system. After we label the representative samples of each cluster, we can propagate the same label to other samples in the same cluster. Using this method, we can annotate thousands of training examples with a few lines of code. This will further improve the performance of our machine learning model.
Other semi-supervised machine learning techniques
There are other ways to do semi-supervised learning, including semi-supervised support vector machines (S3VM), a technique introduced at the 1998 NIPS conference. S3VM is a complicated technique and beyond the scope of this article. But the general idea is simple and not very different from what we just saw: You have a training data set composed of labeled and unlabeled samples. S3VM uses the information from the labeled data set to calculate the class of the unlabeled data, and then uses this new information to further refine the training data set.
If you’re are interested in semi-supervised support vector machines, see the original paper and read Chapter 7 of Machine Learning Algorithms, which explores different variations of support vector machines (an implementation of S3VM in Python can be found here).
An alternative approach is to train a machine learning model on the labeled portion of your data set, then using the same model to generate labels for the unlabeled portion of your data set. You can then use the complete data set to train an new model.
The limits of semi-supervised machine learning
Semi-supervised learning is not applicable to all supervised learning tasks. As in the case of the handwritten digits, your classes should be able to be separated through clustering techniques. Alternatively, as in S3VM, you must have enough labeled examples, and those examples must cover a fair represent the data generation process of the problem space.
But when the problem is complicated and your labeled data are not representative of the entire distribution, semi-supervised learning will not help. For instance, if you want to classify color images of objects that look different from various angles, then semi-supervised learning might help much unless you have a good deal of labeled data (but if you already have a large volume of labeled data, then why use semi-supervised learning?). Unfortunately, many real-world applications fall in the latter category, which is why data labeling jobs won’t go away any time soon.
But semi-supervised learning still has plenty of uses in areas such as simple image classification and document classification tasks where automating the data-labeling process is possible.
Semi-supervised learning is a brilliant technique that can come handy if you know when to use it.
In partnership with Paperspace.