Blog

An AI algorithm passed a science test. Here’s what you should know.

September 9, 2019

artificial intelligence science — Image credit: Depositphotos

This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.

Last week, the Allen Institute for Artificial Intelligence (AI2) introduced Aristo, an artificial intelligence model that scored above 90 percent on an 8th grade science test and 80 percent on a 12th-grade exam.

Passing a science test might sound mundane, if you’re not familiar with how deep learning algorithms, the current bleeding edge of AI, work. After all, AI is already performing tasks such as diagnosing cancer, detecting fraud and playing complicated games, which are much more complicated than answering simple science questions about the moon and squirrel populations.

But despite its fascinating achievements, deep learning struggles when it comes to tackling problems that require reasoning and commonsense. Finding the subtle and implicit meanings of written and spoken language is especially difficult for contemporary AI algorithms, which are more tuned to searching and finding patterns in data.

In this regard, Aristo’s achievements marks a considerable improvement over previous AI models, which scored no more than 60 percent on the same science test.

But does Aristo’s capabilities imply AI has learned to reason like humans? There seems to be some confusion on this.

As with every advance in AI, Aristo’s achievement was met with much hype and sensation. The headline of Fast Company’s coverage of Aristo reads, “This AI just passed a science test and may be smarter than an eighth grader.” The New York Times described Aristo as “A Breakthrough for A.I. Technology.” The Week’s story is titled, “Artificial intelligence is now as smart as an 8th grader.”

Interestingly, the creators of Aristo don’t make such claims.

Why AI struggles to answer questions

Neural networks, the core component of deep learning algorithms, are complex mathematical functions that are especially good at classifying information. A trained neural network can perform complicated tasks such as recognizing faces, labeling images and converting speech to text.

Neural networks and deep learning have also made inroads in natural language processing and understanding, the branch of AI that deals with finding relevant information and correlations in unstructured text. Deep learning–based language models can perform a wide range of tasks such as translation, question-answering and auto-complete.

But again, deep learning–based language models are still using statistics and comparison of word sequences to complete their tasks. And the problem is, many of the meanings of written text is inferred from abstract concepts and knowledge not explicitly mentioned in the words.

AI language models are known to be good at answering questions when they can find the explicit answer in their corpus of text, and notoriously bad when they need to reason and use commonsense.

But as the authors of Aristo’s paper note, science tests “explore several capabilities strongly associated with intelligence, including language understanding, reasoning, and use of common-sense knowledge.” This might make them a decent benchmark to see whether an AI model can extract meaning beyond what’s explicitly mentioned in text.

For instance, consider the following question: “Which substance is usually found in nature as a liquid, solid, and gas?”

To be able to answer this question, you must know many things, including what are the qualities of liquids, solids and gasses, and what it means to be “found in nature” (as opposed to being created artificially). Finally, you must have a notion of what a substance is.

A classic AI language model will be able to answer that question if it can find a sentence that is an explicit answer. In fact if you type it in Google Search, which uses AI to answer queries, you’ll find something like the following: “Water is the only common substance that is naturally found as a solid, liquid or gas.” This is an excerpt from one of the pages the search engine has indexed.

An example of an efficient question-answering AI system was IBM Watson, which defeated human champions at Jeopardy! However, what’s interesting is that most of the questions mentioned in the game are simple facts that have explicit answers in encyclopedias such as Wikipedia. This means a decent language with enough horsepower would be able to find their answers.

Narrowing down the problem domain to suit narrow AI

The goal of the Aristo project is to continue along the trajectory of previous work done in natural language processing, but “also answer questions where the answer may not be written down explicitly,” the authors write.

But the kind of general thinking and problem-solving we discussed in the previous section is beyond the current narrow AI technologies, which are better suited for limited and closed domains. Therefore, the designers of Aristo modified the science exams to make it easier for the AI to answer the questions.

As the authors mention, Aristo has been designed to only answer “non-diagram, multiple choice (NDMC) questions.” There’s a lot of significance in this little detail.

First, diagrams and images present many challenges to AI algorithms and require them to perform complicated computer vision tasks. Deep learning algorithms are pretty good at classifying images and detecting objects in photos, but when it comes to understanding the meaning of images and answering questions about them, they still have a long way to go.

Also, limiting the problem to answering multiple-choice questions simplifies the problem considerably. Multiple-choice questions provide the AI with possible answers, which makes it easier to search and compare those answers to its corpus of knowledge. In contrast, asking the AI to provide direct answers to questions instead of choosing from existing answers is a much more complicated problem, and creating coherent text that explains a scientific concept makes the problem even harder.

As Aristo’s creators explain, “[Questions that require direct answers] are complex, often requiring explanation and synthesis.”

Another interesting detail mentioned in the paper is this: “In the occasional case where two questions share the same preamble, the preamble is repeated for each question so they are independent.” Following and relating information across a sequence of questions is something we take for granted. But for the AI, you have to spell it out every time.

All the limitations posed on the testing system highlight the limits of current AI models. But for a student who already has the knowledge required to pass the 8th grade science exam, solving diagrams, giving written answers to questions, and following related questions would be trivial.

We see the same kinds of limitations in other domains where AI is set to solve problems that usually require the general and abstract problem-solving capabilities of the human brain. For example, earlier this year, DeepMind created a very interesting AI system that could play the real-time strategy game StarCraft II at championship level. But the AI could only play one of three races and on a limited number of maps. The slightest change to the conditions would significantly degrade its performance.

How does Aristo answer science questions?

Contrary to what some reporters have alluded to, Aristo is not reasoning like a high-school student. But it is nonetheless an interesting composition of different AI techniques, including deep learning. Collectively, the different parts that constitute Aristo move the needle of AI-powered question-answering by good notch.

Aristo also has a large corpus of knowledge of 10 large datasets, spanning across approx. 300 gigabytes of data, which include articles about scientific topics and structured knowledge graphs about the relations and definitions of different objects and concepts.

The AI model is composed of various search, reasoning and large-scale language models, each of which evaluates the different parts of a question and search for the option that best answers it. The search models use lookup methods to probe Aristo’s corpus for explicit mentions or relations between the answers and the different points raised in the questions.

The reasoning part of the AI model uses tuples, short sentences that define relations between different objects (e.g. moon reflects light), to determine which of the answers have stronger ties to the different requirements of the question.

Both the reasoning and search modules won’t work if you remove the multiple-choice answers, because they can’t compose their own answers from their knowledge corpus.

aristo ai tuple inference solver — An example of Aristo’s Tuple Inference Solver (source: Arxiv.org)

The most important component of the Aristo AI, however, is the large-scale language models. Aristo uses BERT, an AI language model developed by Google in 2018, and RoBERTa, a variation of the same AI developed by Facebook in 2019. Both are deep learning models trained on very large corpora of text.

According to the authors, “We apply BERT to multiple choice questions by treating the task as classification.” Interestingly, classification is exactly what deep learning is very good at. So, the creators of Aristo have been able to cleverly arrange the problem to fit the format and strengths of current AI technology. But on the other hand, classification wouldn’t be possible if they didn’t already have multiple possible answers to the question, which further highlights the lack of reasoning and the rigidness of the model.

AI2’s engineers have created their own versions of the two language models, named AristoBERT and AristoRoBERTa, which they’ve finetuned with science training sets. This makes the AI models much more specialized for science tasks (and less attuned to more general language tasks). This is also in line with the properties of current AI technology: The narrower the domain, the better they perform.

Although Aristo’s final answer comes from combining the output of its different components, it owes most of its success to the large-scale language models.

aristo ai performance chart — Aristo was able to achieve remarkable improvement thanks to deep learning–based language models. (source: Arxiv.org)

Does Aristo compare to human intelligence?

Contrary to the hype that tech news outlets have created around Aristo, its creators don’t have any illusions that their AI model rivals the intelligence of human beings. They make it clear at the beginning of the paper that Aristo is “not a full solution to general question-answering.”

The authors also show in their paper that special modifications to the questions can cause the AI model can fail in unexpected ways, a phenomenon known as adversarial examples. For instance, Aristo might choose the right answer to a question with four possible answers, but the wrong one if the number of answers is increased and several unrelated options are added.

“These results show that while Aristo performs well, it still has some blind spots that can be artificially uncovered through adversarial methods,” the authors note.

It’s also worth verifying how Aristo performs against other scientific challenges. Ernest Davis, professor at New York University, suggests testing the AI’s performance in answering questions that are easy to answer for humans but hard for computers.

For instance, it would be interesting to see how Aristo handles to notion of changes to quantity by asking it a simple question: “In 1990 there were 500 squirrels in the park; in 2019 there are 1,000. The number A) increased B) decreased”

“The current neural network approaches will find it difficult to determine which combinations of ‘later’, ‘earlier’, ‘more’, and ‘less’ constitute ‘increase’ and which constitute ‘decrease,'” Davis says. “Neural networks have no inherent idea of magnitude or of time.”

There are other ways Aristo’s “knowledge” of different concepts can be put to test. For instance, consider the following question mentioned in the paper: “Which equipment will best separate a mixture of iron filings and black pepper? (1) magnet (2) filter paper (3) triple-beam balance (4) voltmeter?” Aristo correctly chooses magnet as the answer.

Does this mean that Aristo knows that the properties of magnets and iron fillings and what it means to separate them, or is it only choosing the answer because of strong statistical associations between the two words “magnet” and “iron”? To put this to test, Davis suggests posing the simple question: “You have a mixture of iron filings and iron nails. Can you use a magnet to separate them? A) Yes B) No”

“My feeling is that to a large extent, this is a Clever Hans phenomenon,” Davis says, referring to the famous horse that was ostensibly able to perform arithmetic. “That is to say: It’s impressive in its way. Clever Hans, the horse, was also impressive in his way; he was an extraordinarily sensitive reader of body language. But he didn’t actually know arithmetic; and there’s no reason to think that Aristo actually understands science.”

Other renowned software engineers have also questioned claims that AI has understanding of science in the same way as humans.

No, not at all.

That AI has no sense of understanding. https://t.co/E7MnKCin7i

— Grady Booch (@Grady_Booch) September 5, 2019

Aristo is neither smarter nor as smart as an 8th grader. Neither is it a breakthrough technology; it’s more of an incremental improvement over previous work, albeit a remarkable one.

The authors describe Aristo as “only a step on the long road toward a machine that has a deep understanding of science.” They also acknowledge that an AI that understands science should be able to generate short and long answers to direct questions, design experiments to test hypotheses and be able to learn from other experts.

“These are all ambitious tasks still largely beyond the current technology, but with the rapid progress happening in NLP and AI, solutions may arrive sooner than we expect,” Aristo’s creators conclude.

How Cursor’s Composer 2.5 uses self-distillation to beat the frontier LLMs…

Vertical integration as AI infrastructure: What 21D’s full arch implant system…

Why sandboxing OpenClaw doesn’t stop data exfiltration

Google brings multi-token prediction Gemma 4 LLMs

How Memory Sparse Attention scales LLM memory to 100 million tokens

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

OpenAI’s GPT-5: A reality check for the AI hype train

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

An AI algorithm passed a science test. Here’s what you should know.

Why AI struggles to answer questions

Narrowing down the problem domain to suit narrow AI

How does Aristo answer science questions?

Does Aristo compare to human intelligence?

Like this:

Leave a ReplyCancel reply

Why AI struggles to answer questions

Narrowing down the problem domain to suit narrow AI

How does Aristo answer science questions?

Does Aristo compare to human intelligence?

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks