Can you trust ChatGPT and other LLMs in math?

8 min read
llm mathematics
Image source: 123RF

This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.

ChatGPT and other large language models (LLM) have proven to be useful for tasks other than generating text. However, in some fields, their performance is confusing. One such area is math, where LLMs can sometimes provide correct solutions to difficult problems while at the same time failing at trivial ones.

There is a body of research that explores the capabilities and limits of LLMs in mathematics. A recent study by researchers at several universities found ChatGPT to perform below an average mathematics graduate student. And a separate study by NYU professor Ernest Davis found that LLMs fail on very simple mathematical problems posed in natural language.

These studies help us better understand the problem-solving gap between humans and deep neural networks. Humans and LLMs might reach the same result on certain mathematical problems, but their methods differ. And this can prove to be critical when it comes to entrusting LLMs (and other deep learning models) with tasks that require planning and reasoning.

ChatGPT struggles with advanced math

math spiral patterns

Mathematics underpins many quantitative domains of knowledge, such as engineering and social sciences. However, non-mathematicians working in these domains might turn to ChatGPT to answer mathematical questions.

“Because ChatGPT always phrases its answers with a high degree of confidence, this group of people might have difficulties telling correct mathematics apart from incorrect mathematical reasoning, which might lead to a bad decision being taken further down the line, since they rely on faulty mathematics,” Simon Frieder, machine learning researcher at Oxford University, said to TechTalks. “Therefore, it is important to inform these groups of the limits of ChatGPT, so that no undue confidence is placed on the usage of ChatGPT.”

Frieder is the co-author of a recent paper that explores ChatGPT’s capacity to emulate the skills required for professional mathematics. The authors have assembled a dataset called GHOSTS, composed of problems in a range of areas including answering computational questions, completing mathematical proofs, solving problems posed in mathematical Olympiads, and searching through math literature.

The problems were pulled from several sources, including graduate-level textbooks, other mathematical datasets, and knowledge corpora.

Their findings show that ChatGPT performs under passing grade on most tasks. In some tasks, it makes progress to a point. For example, on grad-text questions, the researchers note that ChatGPT “never failed to understand a query” (some may argue whether “understand” is the correct term to use for an LLM) but produces faulty answers. Frieder said that ChatGPT fails particularly egregiously on problems that “require ingenious proofs” such as questions from Olympiads.

And it also showed very poor numerical abilities. “Even computing simple integrals (antiderivatives) can be difficult, since ChatGPT often gets the general structure of the antiderivative right, but misses the precise constant,” Frieder said.

One of the interesting findings of the paper is the mismatch between different levels of problem-solving. For example, in holes-in-proofs problems, the researchers found that ChatGPT often “executes complicated symbolic tasks with ease” but on many occasions, it “fails on basic arithmetic or basic rearranging.”

ChatGPT math skills
ChatGPT math skills

“There is no domain of mathematics on which you can fully trust ChatGPT’s output (this is also true beyond mathematics, due to the stochastic nature of language models),” Frieder said. “This is a general limitation of language models, that there are no rigorous approaches to guarantee correctness of output, of which ChatGPT also suffers.”

One of the problems with ChatGPT is its high confidence and authoritative voice, even when it fails utterly and uses faulty logic. This makes it very difficult to use it as a reliable source of mathematical knowledge.

“It seems fair to say that ChatGPT is inconsistently bad at advanced mathematics: While its ratings drop with the mathematical difficulty of a prompt, it does give insightful proofs in a few cases,” the researchers write.

Language models struggle at basic math

mathematics number patterns

In another recent paper, Ernest Davis explores the capacities and limitations of LLMs in solving problems such as the following: “George has seven pennies, a dime, and three quarters. Harriet has four pennies and four quarters. First, George gives Harriet thirty-one cents in exact change; then Harriet gives him back exactly half of her pennies. How much money does George now have?”

These are very simple problems with three characteristics: They require elementary math skills, they are posed in natural language, and they involve commonsense world knowledge.

Davis tests three different approaches. First, the LLM is asked to output the answer directly. The second approach is for the model to output a computer program that solves the problem. And third is outputting a formalized representation that can be input to an automated theorem verifier program. LLMs perform poorly in all three fields, though they are slightly better when asked to generate a program that can solve the problem.

So, if LLMs can’t perform simple elementary math, why are they showing remarkable performance on benchmarks designed for evaluating mathematical skills in AI?

LLMs have made remarkable progress in linguistics and to a lower degree in commonsense and math. But, Davis points out in his paper, LLMs are poor at combining several skills. For example, LLMs in general “do much worse on problems that involve two arithmetic operations than those that require one, both in word problems and in purely mathematical problems.”

One of the important points Davis makes is that the math benchmarks developed to evaluate AI systems are taken from tests developed for humans. This can result in misleading conclusions. In humans, basic skills become the foundation on which we build more advanced skills. So, for example, if a person can solve differential equations, you would expect them to be proficient in linear algebra.

But machine learning models such as LLMs can sometimes find the answers to complicated problems without acquiring the same skills as humans. This can be due to the answers being directly in its training data (especially when the model is trained on a very large corpus). It can also be caused by the model discovering patterns that lead to the right answer most of the time but not always.

“[LLMs] can’t address any areas of math reliably or any advanced area of math close to reliably,” Davis told TechTalks. At the same time, “There is probably no area of math where you will never get a correct answer, because it will sometimes simply regurgitate answers that it has seen in its training set.”

Interestingly, Davis finds some of the same kinds of inconsistencies found in the other paper. For example, while ChatGPT can solve basic and intermediate math problems frequently, it occasionally fails at very simple tasks, such as counting.

“My own feeling is that, if an LLM can’t solve simple arithmetic problems then there is really not much point in asking how well it does on problems in measure theory, topology, or abstract linear algebra,” Davis said.

Searching mathematical knowledge

Image source: 123RF

In the tests that Frieder and his colleagues carried out, two areas where ChatGPT performed especially well were definition retrieval and reverse definition retrieval. Basically, it can serve as a very good search engine for mathematical knowledge, providing descriptions of topics and mapping descriptions back to the main concept.

“[ChatGPT] performs particularly well as a mathematical knowledge base: It can be used to emulate a mathematical search engine and to retrieve various facts about more advanced objects,” Frieder said. “For example, I myself learned of a certain instance of a more general object, called weak-* topology, that is called the ‘vague topology’. At first, the word choice seemed odd and I thought ChatGPT was hallucinating, until I searched the internet and found out, via Wikipedia, that the ‘vague topology’ was the correct name for a specific mathematical object.”

Davis agrees that ChatGPT and other LLMs can be good at text matching, which has also been observed in other areas such as standardized science tests. But he also pointed out that it remains to be seen how they compare in practice against the likes of Wikipedia and Mathworld in definition retrieval and reverse definition retrieval.

“For inverse definitional retrieval, one would want to check how robust their system is on rewordings of the definition from a standard textbook one, both in ways that preserve meaning, where it should certainly succeed, and in ways that change the meaning, where it should presumably not succeed, or ideally give some kind of a warning,” Davis said.

Fine-tuned LLMs

transformer model

In their paper, Frieder and his colleagues find that deep learning models that have been designed for math problems or LLMs fine-tuned on math datasets are much more accurate than ChatGPT.

One work that stands out is a deep learning model designed by Facebook AI researchers Guillaume Lample and Francois Charton in 2019. They created a system to generate training datasets for supervised learning of integration and first- and second-order differential equations. They then trained a transformer model on the training data. Their findings show that the model performs better than rule-based algebra programs such as MATLAB and Mathematica.

A second model is Minerva, a large language model by Google pre-rained on general natural language data and fine-tuned on data for mathematics word problems, competition mathematics evaluations, and problems in science and engineering. Expectedly, both models outperformed ChatGPT in math problems.

“Fine-tuning on larger mathematical dataset is likely to increase performance,” Frieder said.

Davis also agreed that fine-tuned transformers like the one developed by Lample and Charton can “occasionally find solutions to problems that standard systems like Mathematica miss.” But he noted that “the kinds of problems that these systems have been able to solve would almost never be of actual mathematical interest.”

“That could change,” he added.

Davis also pointed out that there have been a few actual cases where AI systems have been useful to mathematicians, including a deep learning system that discovered fast matrix multiplication algorithms and a few systems that were able to help in graph theory. “But this kind of assistance is very much hit-and-miss,” he said.

More work to be done

There is still much room for discovery in the field. With better fine-tuning, these systems might one day be reliable assistants for people without higher degrees in mathematics. But currently, the cost of evaluating the mathematical performance of LLMs is prohibitive.

Frieder and his colleagues have open-sourced their dataset and framework, which includes a detailed system of grading the model’s output and confidence, error and warning codes, and commentary by the reviewers. The data assembly and review process required manual efforts by experts and could not be outsourced through platforms such as Amazon Mechanical Turk. By making it publicly available, they will “encourage the community to contribute and grow these datasets, so that they can be used as a useful benchmark for other LLMs,” the researchers write.

Frieder and his coauthors are also working on an upcoming paper that will investigate whether it is possible to do automated mathematics purely with LLM. “This will give an empirical answer to an old question of whether mathematics needs to be formalized or not, in order to do automated theorem proving,” he said.


  1. Can a LLM to generate a procedure for solving a mathematical problem when that procedure is logical derivable from the mathematics it has been trained on but was not contained in the training corpus and could not be generated using statistical probabilities between elements in the corpus (i.e. the way it generates language)? Do deep-learning devices trained primarily on mathematics do this?

  2. [Mathematics under the scrutiny of ChatGPT and other LLMs] This article sheds light on the capabilities and limitations of language models such as ChatGPT in solving mathematical problems. While these models have proven useful in several domains, including math, their performance in this field is often below that of an average graduate student. However, studies like these help us understand the problem-solving gap between humans and deep neural networks, highlighting the importance of informing non-mathematicians using ChatGPT of its limitations.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.