Why computer vision algorithms need new benchmarks

random images computer vision
Image credit: Depositphotos

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

In recent years, the creation of large data sets of labeled images has helped in the development of highly efficient computer vision systems. Artificial intelligence models trained and tested on repositories like ImageNet (approx. 14 million photos) and OpenImages (approx. 9 million photos) can match—and sometimes exceed—human performance at detecting specific classes of objects.

However, these datasets also have a fundamental problem: They mostly contain images that have been taken in ideal conditions. As a result, the deep learning models that are trained on them are biased toward their characteristics and are poor at handling the messiness found in the actual world.

In a paper presented at NeurIPS 2019, researchers at the MIT-IBM Watson AI Lab and MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) explore the problems with current computer vision datasets. The researchers propose ObjectNet, a new image repository carefully assembled to avoid the biases found in popular image datasets and to reflect the realities AI algorithms face in the real world.

Even the best object detectors suffer from a considerable performance drop when tested on ObjectNet, which highlights the need for more efficient ways to evaluate the robustness and accuracy of computer vision—and by extension machine learning—systems.

What is wrong with ImageNet?

AI researchers usually collect their datasets from repositories of existing images such as Flickr. The problem is, the photographers did not take their photos with AI training in mind—they primarily wanted to post nice pictures on the web.

“ImageNet and other large image and video datasets are typically created by taking images off the web. The problem is that images people tend to upload are in a sense ‘cleaner’ than real-world visual content that robots or autonomous cars may encounter,” Dan Gutfreund, Research Staff Member and Principal Investigator at MIT-IBM Watson AI Lab, told TechTalks in written comments.

Gutfreund coauthored the ObjectNet paper with research scientists Boris Katz and Andrei Barbu, and their students from MIT CSAIL. Also on the team was MIT cognitive science professor Joshua Tenenbaum, a distinguished researcher who is renowned for his research on the human mind and hybrid AI systems.

For example, people typically upload pictures of their dogs either taken from the front or the side but much less so from behind. Also, the dog will typically be in full view and not partially occluded behind another object. And finally, the dog will be in its typical surroundings, for example, sleeping on the couch or playing in the park.

“You don’t tend to see dogs standing on kitchen tables,” Gutfreund says.

The deep learning models trained on these datasets adopt the biases the repositories contain. Consequently, they show very high accuracy on the data contained in those and similar repositories. In some cases, the AI models even outperform humans at those specific tasks. But they perform poorly on datasets that don’t contain those biases.

“Once you remove the biases, for example, by showing partially occluded objects out of their context, then these models fail miserably, while humans don’t. As a result, it is hard to predict how these models will perform in the real world,” Gutfreund says.

ImageNet images vs ObjectNet images
ImageNet vs reality: In ImageNet (left column) objects are neatly positioned, in ideal background and lighting conditions. In the real world, things are messier (source: objectnet.dev)

Deep learning models trained on biased datasets become unstable in the real world, where lighting conditions, backgrounds, rotations, occlusions, and other factors are not consistent. In some environments, transfer learning, the practice of fine-tuning a pre-trained AI model with another dataset, can help improve the performance of deep learning models introduced to new problems. But fine-tuning and increasing dataset size can’t address the complexities of the real world, where every object can appear in virtually countless positions and situations.

Why is this important? Because computer vision algorithms are becoming increasingly important in sensitive fields such as health care, autonomous driving, and identity verification. These are fields where lack of robustness in AI models can have fatal consequences.

There is a growing interest in finding ways to create AI systems that replicate the human vision system, which doesn’t need a ton of training data. Humans can also fill the gaps when they face situations they haven’t experienced before.

“Deep learning models are very good at identifying recurrent patterns in large data. However, they do not generalize well to rarely occurring or unseen scenarios, e.g., out of distribution examples,” Gutfreund says. “Humans, on the other hand, are very good at generalizing to new settings. A person seeing for the first time a black swan after only observing white swans will immediately realize that it is still a swan. It seems like our brains have abstraction capacity that DL models do not have.”

ObjectNet: A new way to test computer vision algorithms

The researchers at MIT-IBM Watson AI Lab developed ObjectNet with the goal of removing the biases that exist in current image datasets. Instead of curating photos from existing sources on the web, the researchers crowdsourced the photos on Mechanical Turk, Amazon’s micro-task platform.

The Turkers took individual pictures of the objects and sent them for review. The process made sure there were enough varieties in the background, lighting, rotation, and other factors. The ObjectNet contains 50,000 images, distributed across 313 object classes.

In some ways, ObjectNet is easier than other datasets. The objects are centered in the image, the backgrounds are not necessarily cluttered, and only on rare occasions, the objects are lightly occluded. But in other ways, ObjectNet poses new challenges, introducing rotations and viewpoints not usually seen in other AI training datasets. While in general, humans have no problem identifying the objects in ObjectNet, in rare cases, they might find them confusing.

“This demonstrates a much wider range of difficulty and provides an opportunity to also test the limits of human object recognition – if object detectors are to augment or replace humans, such knowledge is critical,” the researchers note in their paper. “Our overall goal is to test the bias of detectors and their ability to generalize to specific manipulations, not to just create images that are difficult for arbitrary reasons.”

How existing deep learning models perform on ObjectNet

Unlike other datasets, ObjectNet only includes a test dataset. It does not come paired with training data. “Separating training and test set collection may be an important tool to avoid correlations between the two which are easily accessible to large models but not detectable by humans,” the MIT and IBM researchers write. “Since humans easily generalize to new datasets, adopting this separation can encourage new machine learning techniques that do the same.”

The ObjectNet test also disallows finetuning to avoid overfitting AI models to the data contained in the dataset.

When tested on popular image classifiers such as AlexNet, ResNet and Inception, the researchers saw a 40-45-percent performance drop.

Top Image classifiers performance on ObjectNet
Top image classifiers see a 40-45-percent performance drop when tested against ObjectNet

The researchers observed “a large performance gap depending on the background (15%), rotation (20%) and viewpoint (15%).” And this happened while the researchers did not specify where in a room to pose an object and how cluttered the background should be when taking pictures.

Even finetuning the AI models resulted in very limited performance improvement. Humans, meanwhile, scored 95 percent on the dataset.

“ObjectNet is challenging because of the intersection of real-world images and controls. It pushes object detectors beyond the conditions they can generalize to today,” the researchers write.

The work done on ObjectNet also sheds light on the problem with approaches that try to solve problems by throwing more data at AI models.

“More data improves results but the benefits eventually saturate. The expected performance of many object recognition applications is much lower than traditional datasets indicate,” the researchers write.

Larger architectural changes to object detectors that directly address phenomena like those controlled for in ObjectNet “would be beneficial and may provide the next large performance increase,” the researchers observe. “ObjectNet can serve as a means to demonstrate this robustness which would not be seen in standard benchmarks.”

“We need to develop mechanisms for abstraction and reasoning,” Gutfreund told TechTalks. “Attention mechanisms, which learn which parts of the images are the most important ones (and thus ignore the unimportant details in the background) are also important and have been a topic of extensive research over the last few years. Transfer and zero-shot learning techniques, which directly address the out of distribution problem, have also been extensively studied recently.”

The challenges of collecting quality AI training data

ObjectNet also highlights the challenges of obtaining the right data for training machine learning models. The AI researchers employed 6,000 Mechanical Turkers to collect nearly 100,000 photos. The workers had to use a mobile app that guided them through the steps of positioning the items and taking the pictures.

ObjectNet image collection app process
Mechanical Turk workers who took part in collecting images for ObjectNet had to go through an exhaustive process to make sure their photos were conformant with the test’s standards.

The researchers then manually examined each of those photos, discarding those that did not conform with the guidelines, such as those that had incorrect backgrounds, showed faces or had private information. The entire process took nearly four years.

“Most of the time was spent on researching, developing and refining the data collection mechanism,” Gutfreund says. “Now that we have a working protocol, the data collection can be done fairly quickly.”

As a next iteration, the researchers are thinking about finding ways to remove other types of biases, such as showing objects that are partially occluded. “It is very challenging to do that in a systematic and controlled manner. So, it will take more time to refine the mechanism to address this type of bias. But, we’re hopeful and at IBM, we are looking to achieve AI results in 2020 more broadly that can achieve more while using less (e.g., data, time, and other resources),” Gutfreund says.


  1. Hi Ben, great to read about different ways this problem is being solved. Starting recently we build synthetic-based models at CVEDIA, and as a result our data science team has been able to create (less biased) benchmarking reports. Excited to bring this technology to market in early 2020.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.