Blog

How to teach AI to reason about videos

May 4, 2020

video reel — Image credit: Depositphotos

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

Look at the short video below. Can you answer the following questions: Which object caused the ball to change direction? Where will the ball go next? What would happen if you removed the bat from the scene?

You might consider these questions very dumb. But interestingly, today’s most advanced artificial intelligence systems would struggle to answer them. Questions such as the ones asked above require the ability to reason about objects and their behaviors and relations over time. This is an integral component of human intelligence, but one that has remained elusive to AI scientists for decades.

A new study presented at ICLR 2020 by researchers at IBM, MIT, Harvard, and DeepMind highlight the shortcomings of current AI systems in dealing with causality in videos. In their paper, the researchers introduce CLEVRER, a new dataset and benchmark to evaluate the capabilities of AI algorithms in reasoning about video sequences, and Neuro-Symbolic Dynamic Reasoning (NS-DR), a hybrid AI system that marks a substantial improvement on causal reasoning in controlled environments.

Why artificial intelligence can’t reason about videos

For us humans, detecting and reasoning about objects in a scene almost go hand in hand. But for current artificial intelligence technology, they’re two fundamentally different disciplines.

In the past years, deep learning has brought great advances to the field of artificial intelligence. Deep neural networks, the main component of deep learning algorithms, can find intricate patterns in large sets of data. This enables them to perform tasks that were previously off-limits or very difficult for computer software, such as detecting objects in images or recognizing speech.

It’s amazing what pattern recognition alone can achieve. Neural networks play an important role in many of the applications we use every day, from finding objects and scenes in Google Images to detecting and blocking inappropriate content on social media. Neural networks have also made some inroads in generating descriptions about videos and images.

But there are also very clear limits to how far you can push pattern recognition. While an important part of human vision, pattern recognition is only one of its many components. When our brain parses the baseball video at the beginning of this article, our knowledge of motion, object permanence, solidity, and motion kick in. Based on this knowledge, we can predict what will happen next (where the ball will go) and counterfactual situations (what if the bat didn’t hit the ball). This is why even a person who has never seen baseball played before will have a lot to say about this video.

A deep learning algorithm, however, detects the objects in the scene because they are statistically similar to thousands of other objects it has seen during training. It knows nothing about material, gravity, motion, and impact, some of the concepts that allow us to reason about the scene.

Visual reasoning is an active area of research in artificial intelligence. Researchers have developed several datasets that evaluate AI systems’ ability to reason over video segments. Whether deep learning alone can solve the problem is an open question.

Some AI scientists believe that given enough data and compute power, deep learning models will eventually be able to overcome some of these challenges. But so far, progress in fields that require commonsense and reasoning has been little and incremental.

The CLEVRER dataset

The new dataset introduced at ICLR 2020 is named “CoLlision Events for Video REpresentation and Reasoning,” or CLEVRER. It is inspired by CLEVR, a visual question-answering dataset developed at Stanford University in 2017. CLEVR is a set of problems that present still images of solid objects. The AI agent must be able to parse the scene and answer multichoice questions about the number of objects, their attributes, and their spatial relationships.

CLEVR example — CLEVR is a visual question-answering dataset that tests the capabilities of AI systems in reasoning about the content of images. (Source: Stanford Computer Science)

CLEVRER is constituted of videos of solid objects moving and colliding with each other. AI agents will be tested in their ability to answer descriptive, explanatory, predictive, and counterfactual questions about the scenes. For instance, in the below scene, the AI will be asked questions such as the following:

Descriptive: What is the material of the last object to collide with the cylinder?
Explanatory: Does the collision between the rubber cylinder and the red rubber sphere cause the collision between the rubber and metal cylinder?
Predictive: Will the metal sphere and the gray cylinder collide?
Counterfactual: Will the red rubber sphere and the gray cylinder collide if we remove the cyan cylinder from the scene?

Like the questions asked about the video at the beginning of this article, these questions might sound trivial to you. But they are complicated tasks to accomplish with current blends of AI because they require a causal understanding of the scene.

As the authors of the paper summarize, solving CLEVRER problems requires three key elements: “recognition of the objects and events in the videos; modeling the dynamics and causal relations between the objects and events; and understanding of the symbolic logic behind the questions.”

“CLEVRER is a first visual reasoning dataset that is designed for casual reasoning in videos. Previous visual reasoning datasets mostly focus on factual questions, such as what, when, where, and is/are. But the most fundamental reasoning ability is to understand ‘why,’” Chuang Gan, research scientist at MIT-IBM Watson AI Lab and co-author of the CLEVRER paper, told TechTalks.

A controlled environment

CLEVRER is “a fully-controlled synthetic environment,” as per the authors of the paper. The type and material of objects are few, all the problems are set on a flat surface, and the vocabulary used in the questions is limited. This bit of detail is very important because current AI systems are very bad at handling open environments where the combination of events that can happen is unlimited.

The controlled environment has enabled the developers of CLEVRER to provide richly annotated examples to evaluate the performance of AI models. It allows AI researchers to focus their model development on complex reasoning tasks while removing other hurdles such as image recognition and language understanding.

But what it also implies is that if an AI model scores high on CLEVRER, it doesn’t necessarily mean that it will be able to handle the messiness of the real world where anything can happen. The model might work on other limited environments, however.

“The use of temporal and causal reasoning in videos could play an important role in robotic and automatic driving applications,” says Gan. “If there was a traffic accident, for example, the CLEVRER model could be used to analyze the surveillance videos and uncover what was responsible for the crash. In robotics application, it could also be useful if the robot can follow natural language command and take action accordingly.”

The Neuro-Symbolic Dynamic Reasoning AI model

The authors of the paper tested CLEVRER on basic deep learning models such as convolutional neural networks (CNNs) combined with multilayer perceptrons (MLP) and long short-term memory networks (LSTM). They also tested them on variations of advanced deep learning models TVQA, IEP, TbDNet, and MAC, each modified to better suit visual reasoning.

The basic deep learning performed modestly on descriptive challenges and poorly on the rest. Some of the advanced models performed decently on descriptive challenges. But on the rest of the challenges, the accuracy dropped considerably. Pure neural network–based AI models lack understanding of causal and temporal relations between objects and their behavior. They also lack a model of the world that allows them to foresee what happens next and figure out how alternative counterfactual scenarios work.

As a solution, the researchers introduced the Neuro-Symbolic Dynamic Reasoning model, a combination of neural networks and symbolic artificial intelligence. Symbolic AI, also known as rule-based AI, has fallen by the wayside with the rise of deep learning. Unlike neural networks, symbolic AI systems are very bad at processing unstructured information such as visual data and written text. But on the other hand, rule-based systems are very good at symbolic reasoning and knowledge representation, an area that has been a historical pain point for machine learning algorithms.

NS-DR puts both neural networks and symbolic reasoning systems to good use:

A convolutional neural network extracts objects from images.
An LSTM processes the questions and converts them into program commands.
A propagation network learns the physical dynamics from the object data extracted by the CNN and predicts future object behavior.
Finally, a Python program brings together all the structured information obtained from the neural networks to compile the answer to the question.

NS-DR structure — The Neuro-Symbolic Dynamic Reasoning model puts together neural networks and symbolic artificial intelligence

The performance of NS-DR is considerably higher than pure deep learning models on explanatory, predictive, and counterfactual challenges. The counterfactual benchmark still stands at a modest 42 percent accuracy, however, which speaks to the challenges of developing AI that can understand the world as we do. But it is still a significant gain in comparison to the 25-percent accuracy of the best-performing baseline deep learning model.

Another significant benefit of NS-DR is that it requires much less data in the training phase.

The results show that incorporating neural networks and symbolic programs in the same AI model can combine their strengths and overcome their weaknesses. “Symbolic representation provides a powerful common ground for vision, language, dynamics and causality,” the authors note, adding that symbolic programs empower the model to “explicitly capture the compositionality behind the video’s causal structure and the question logic.”

The benefits of NS-DR do come with some caveats. The data used to train the model requires extra annotations, which might be too energy-consuming and expensive in real-world applications.

A stepping stone toward more generalizable AI systems

“Truly intelligent AI should not only solve pattern recognition problems, like recognizing an object and their relation. More importantly, it should build a causal model about the world, which can be used to help explain and understand the physical world,” Gan says. “NS-DR is our preliminary attempt to approach this complex problem.”

Gan acknowledges that NS-DR has several limitations to extend to rich visual environments. But the AI researchers have concrete plans to improve visual perception, dynamic models, and the language understanding module to improve the model’s generalization capability.

CLEVRER is one of several efforts that aim to push research toward artificial general intelligence. Another remarkable work in the field is the Abstract Reasoning Corpus, which evaluates the ability of software to develop general solutions to problems with very few training examples.

“NS-DR is a stepping stone towards future practical applications,” Gan says. “We believe the toolkit we have (combining visual-perception, object-based planning, and neuro- symbolic RL) might be one of the promising approaches to make fundamental progress toward building more genuinely intelligent machines.”

Moving beyond passive RAG: How to implement active memory reconstruction for…

How self-improving harnesses are rewriting the agent engineering playbook

How Nvidia’s ASPIRE framework accelerates robot programming with self-improving AI

How the AI arms race moved from smart models to full-stack…

Why LLMs should stop thinking out loud (and what comes after…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

How to teach AI to reason about videos

Why artificial intelligence can’t reason about videos

The CLEVRER dataset

A controlled environment

The Neuro-Symbolic Dynamic Reasoning AI model

A stepping stone toward more generalizable AI systems

Like this:

Leave a ReplyCancel reply

Why artificial intelligence can’t reason about videos

The CLEVRER dataset

A controlled environment

The Neuro-Symbolic Dynamic Reasoning AI model

A stepping stone toward more generalizable AI systems

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks