What is...

What is data augmentation?

November 27, 2021

This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.

Machine learning models can perform wonderful things—if they have enough training data. Unfortunately, for many applications, access to quality data remains a barrier.

One solution to this problem is “data augmentation,” a technique that generates new training examples from existing ones. Data augmentation is a low-cost and effective method to improve the performance and accuracy of machine learning models in data-constrained environments.

Overfitting in machine learning models

When machine learning models are trained on limited examples, they tend to “overfit.” Overfitting happens when an ML model performs accurately on its training examples but fails to generalize to unseen data.

There are several ways to avoid overfitting in machine learning, such as choosing different algorithms, modifying the model’s architecture, and adjusting hyperparameters. But ultimately, the main remedy to overfitting is adding more quality data to the training dataset.

For example, consider the convolutional neural network (CNN), a type of machine learning architecture that is especially good for image classification tasks. Without a large and diverse set of training examples, a CNN will end up misclassifying images in the real world. On the other hand, if a CNN is trained on images of objects from different angles and under different lighting conditions, it will become more robust at identifying them in the real world.

However, gathering extra training examples can be expensive, time-consuming, or sometimes impossible. This challenge becomes even more difficult in supervised learning applications where training examples must be labeled by human experts.

Data augmentation

One of the ways to increase the diversity of the training dataset is to create copies of the existing data and make small modifications to them. This is called “data augmentation.”

For example, say you have twenty images of ducks in your image classification dataset. By creating copies of your duck images and flipping them horizontally, you have doubled the training examples for the “duck” class. You can use other transformations such rotation, cropping, zooming, and translation. You can also combine the transformations to further expand your collection of unique training examples.

Data augmentation does not need to be limited to geometric manipulation. Adding noise, changing color settings, and other effects such as blur and sharpening filters can also help in repurposing existing training examples as new data.

data augmentation examples — Examples of data augmentation

Data augmentation is especially useful for supervised learning because you already have the labels and don’t need to put in extra effort to annotate the new examples. Data augmentation is also useful for other classes of machine learning algorithms such as unsupervised learning, contrastive learning, and generative models.

Data augmentation has become a standard practice for training machine learning models for computer vision applications. Popular machine learning and deep learning programming libraries have easy-to-use functions to integrate data augmentation into the ML training pipeline.

Data augmentation is not limited to images and can be applied to other types of data. For text datasets, nouns and verbs can be replaced with their synonyms. In audio data, training examples can be modified by adding noise or changing the playback speed.

Limits of data augmentation

Data augmentation is not a silver bullet to solve all your data problems. You can think of it as a free performance booster for your ML models. Based on your target application, you still need a fairly large training dataset with enough examples.

In some applications, training data might be too limited for data augmentation to help. In these cases, you must collect more data until you reach a minimum threshold before you can use data augmentation. Sometimes, you can use transfer learning, where you train an ML model on a general dataset (e.g., ImageNet) and then repurpose it by finetuning its higher layers on the limited data you have for your target application.

Data augmentation also doesn’t address other problems such as biases that exist in the training dataset. The data augmentation process also needs to be adjusted to address other potential problems, such as class imbalance.

Used wisely, data management can be a powerful tool in the machine learning engineer’s toolbox.

1 COMMENT

quarks4569 November 27, 2021 at 7:06 pm

Data Augmentation is a practical technique which can be included in the work flow process but in the instance that the required data needs are met.

Loading...

Why Goodfire’s block-sparse featurizers are a breakthrough in AI interpretability

Moving beyond passive RAG: How to implement active memory reconstruction for…

How self-improving harnesses are rewriting the agent engineering playbook

How Nvidia’s ASPIRE framework accelerates robot programming with self-improving AI

How the AI arms race moved from smart models to full-stack…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

What is data augmentation?

Overfitting in machine learning models

Data augmentation

Limits of data augmentation

Like this:

1 COMMENT

Leave a ReplyCancel reply

Overfitting in machine learning models

Data augmentation

Limits of data augmentation

Like this:

1 COMMENT

Leave a ReplyCancel reply

Discover more from TechTalks