Blog

Artificial intelligence: Does another huge language model prove anything?

February 3, 2020

Robot arm writing — Image credit: Depositphotos

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

This week, Google introduced Meena, a chatbot that can “chat about… anything.” Meena is the latest of many efforts by large tech companies trying to solve one of the toughest challenges of artificial intelligence: language.

“Current open-domain chatbots have a critical flaw — they often don’t make sense. They sometimes say things that are inconsistent with what has been said so far, or lack common sense and basic knowledge about the world,” Google’s researcher wrote in a blog post.

They’re right. Making sense of language and engaging in conversations is one of the most complicated functions of the human brain. Until now, most efforts to create AI that can understand language, engage in meaningful conversations, and generate coherent excerpts of text have yielded poor-to-modest results.

In the past years, chatbots have found a niche in some domains such as banking and news. Advances in natural language processing have also paved the way for the widespread use of AI assistants such as Alexa, Siri, and Cortana. But so far, current AI can only engage in language-related tasks as long as the problem domain remains narrow and limited, such as answering simple queries with clear meanings or carrying out simple commands.

Advanced language models such as OpenAI’s GPT-2 can produce remarkable text excerpts, but those excerpts quickly lose their coherence as they grow in length. As for open-domain chatbots, AI agents that are supposed to discuss a wide range of topics, they either fail at generating relevant results or often provide vague answers that can be given to various questions, like a politician evading giving specific answers at a press conference.

Now, the question is, how much does Meena, Google’s massive chatbot, move the needle in conversational AI?

What’s under the hood?

Like the many innovative language models that have been introduced in the past few years, Google’s Meena has a few interesting details. According to the blog post and the paper published in the arXiv preprint server, Meena is based on the Evolved Transformer architecture.

The Transformer, introduced for the first time in 2017, is a sequence-to-sequence (seq2seq) machine learning model, which means it takes a sequence of data as input (numbers, letters, words, pixels…) and outputs another sequence. Sequence to sequence models are especially good for language-related tasks such as translation and question-answering.

There are other types of seq2seq models such as LSTM (long short-term memory) and GRU (gated recurrent unit) networks. Because of their efficiency in parallel processing and their ability to train on much more data, Transformers have been rising in popularity and have become the basic building block of most cutting-edge language models in the past couple of years (e.g. BERT, GPT-2).

The Evolved Transformer is a specialized type of the AI model that uses algorithmic search to find the best network design for the Transformer. One of the key challenges of developing neural networks is finding the right hyperparameters. The Evolved transformer automates the task of finding those parameters.

A bigger and costlier AI model

Like many other recent advances in AI, Meena owes at least part of its success to its huge size. “The Meena model has 2.6 billion parameters and is trained on 341 GB of text, filtered from public domain social media conversations,” Google’s AI researchers write. In comparison, OpenAI’s GPT-2 had 1.5 billion parameters and was trained on a 40-gigabyte corpus of text.

To be clear, we’re still far from creating AI models that match the complexity of the human brain, which has approximately 100 billion neurons (the rough equivalent of parameters in artificial neural networks) and more than 100 trillion synapses (connections between neurons). So, size does matter. But it’s not everything. For one thing, no human can process 340 GB of text data in their lifetime, let alone needing as much to be able to conduct coherent conversations.

And the obsession with creating bigger networks and throwing more compute and more data at the problem causes problems that are often overlooked. Among them is the cost and carbon footprint of developing such models.

According to the paper, the training of Meena took 30 days on a TPU v3 Pod, composed of 2,048 TPU cores. Google doesn’t have a price listing for the 2,048-core TPU v3 Pod, but a 32-core configuration costs $32 per hour. Projecting that to 2,048 cores ($2,048/hour), it would cost $49,152 per day and $1,474,560 for 30 days. While it’s interesting that Google can allocate such resources to researching bigger AI models, most academic research labs don’t have those kinds of funds to spare. These costs make it difficult to develop such AI models outside of the commercial sector.

More sensible and specific

Benchmarks play a very important role in ranking AI models and evaluating their accuracy and effectiveness. But as we’ve seen in these pages, most AI benchmarks can be gamed and provide misleading results.

To test Meena, Google’s engineers developed a new benchmark, the Sensible and Specificity Average (SSA). Sensibility means that the chatbot must make sense when it’s engaging in conversation with a human. So, if the AI produces an answer that in no way applies to the question, it will score negative on sensibility.

But providing a coherent answer is not enough. Some responses like “Nice!” or “I don’t know” or “Let me think about it” can be applied to many different questions without the AI necessarily understanding their meaning. This is where specificity comes into play. In addition to evaluating the sensibility of the AI, the reviewers also specify whether the agent generated a response that is relevant to the topic of their conversation.

In comparison to other popular chatbot engines, Meena scored much better on SSA.

google meena chatbot ssa score — The Sensibility and Specificity Average (SSA) benchmark determines the coherence and relevance of answers produced by AI chatbots.

This is a remarkable improvement. The paper contains several conversation examples that show Meena stays on the subject throughout multiple exchanges. It even comes up with witty responses in some cases, such as this joke that is mentioned both in the paper and the blog post.

google meena chatbot sample conversation — Google’s AI chatbot Meena sometimes engages in interesting conversations.

But Google has not yet released the model or a demo, so we only have the conversations released in the paper as guidance for how good the AI is. “We are evaluating the risks and benefits associated with externalizing the model checkpoint… and may choose to make it available in the coming months to help advance research in this area,” the researchers write.

Are we closer to AI that understands language?

While Meena shows the remarkable improvements of NLP research, has it gotten us closer to AI that has “common sense and basic knowledge about the world”?

“Many of the gains we are seeing are akin to an advanced form of memorization rather than intelligent behavior,” says AI researcher Stephen Merity. “This type of approach will work well for background knowledge. Common sense and abstraction are far less likely. We don’t yet know how well these models can perform complex reasoning so it’s likely that if you do see common sense, it’ll again be through a fuzzy form of memorization.”

“There’s no demo available, and until there is, I don’t think we should take it seriously,” says cognitive scientist Gary Marcus. “We’ve seen this movie before; GPT was originally dubbed ‘too dangerous to release’; when it came out, it was amusing, and maybe even impressive, but ultimately quite superficial.”

Marcus recently published his observations on evaluating GPT-2.

“I suspect that essentially the same problems will apply: reasoning will be poor, and the system won’t develop cognitive models that are rich enough for it to truly understand what’s going on. It might pass the Turing Test, but it won’t be genuinely intelligent,” Marcus says.

ZDNet’s Tiernan Ray also makes an interesting observation on Meena’s conversations: “Meena is recreating a distribution of language that is startlingly accurate, but also merely a recreation of information, which is boring. Patterns of language in the Meena form are highly associative at a word level, which makes Meena’s best examples those of wordplay. Wordplay is interesting up to a point, and it feels like it reflects intelligence, in some fashion. But it also quickly becomes superficial and tiresome.”

While it’s cool for AI chatbots to come up with conversations like the following, also drawn from Meena’s paper, they won’t serve any purpose.

Clearly, Meena—or any other AI agent for that matter—doesn’t have the same understanding of the world as we have. I don’t expect an AI that can, in a second, labor through more text data than I have read in my entire life to understand what it means like to take the weekend off after five days of hard work. Neither do I think it knows what it means to “love Friday.” As long as AI agents don’t experience life as we do, any semblance to humans in their behavior will at best be a cheap imitation.

On a more serious note, however, I’m looking forward to advances in AI providing us with more utilitarian chatbots, AI agents that can be more flexible in solving specific problems, getting information from the web, extracting implied meanings from the text, such as the conversation that Ray provides in his article.

tiernan ray zdnet chatbot conversation — A much more useful conversation with a chatbot. AI is not ready for this yet.

I’d doubt that Meena is ready to take on such a task. Google has given us another huge language model, but we have still a long way to go before we can claim that AI has come close to understanding human language.

“I think that these types of models will be quite important but that no matter how much we scale them we’re still missing components,” Merity says.

1 COMMENT

Oleg Alexandrov June 22, 2020 at 5:07 am

That Meena scored 79% on the SSA benchmark, while humans score 86%, is certainly impressive. We have seen very nice progress in natural language processing within the last decade. It apparently came about without theoretical breakthroughs, rather from mixing, matching, and refinement of exiting techniques, improvements in efficiency, modeling, and large volumes of data.

Maybe the next advances will come from the machine gradually building up a model of what the topic at hand is, as the interaction with the user adds more information. Or the machine may have a set (maybe a large one) of potential topics and would gravitate to one of them based on how the discussion flows. That way it have a certain coherency and focus, rather than like now, where the conversation just drifts into nonsense.

Either way, my hope is that since we got that far surely the gusher of ideas and the progress momentum won’t just stall at the current results.

Loading...

Moving beyond passive RAG: How to implement active memory reconstruction for…

How self-improving harnesses are rewriting the agent engineering playbook

How Nvidia’s ASPIRE framework accelerates robot programming with self-improving AI

How the AI arms race moved from smart models to full-stack…

Why LLMs should stop thinking out loud (and what comes after…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Artificial intelligence: Does another huge language model prove anything?

What’s under the hood?

A bigger and costlier AI model

More sensible and specific

Are we closer to AI that understands language?

Like this:

1 COMMENT

Leave a ReplyCancel reply

What’s under the hood?

A bigger and costlier AI model

More sensible and specific

Are we closer to AI that understands language?

Like this:

1 COMMENT

Leave a ReplyCancel reply

Discover more from TechTalks