Blog

LLMs battle it out in Street Fighter—here’s what it means for real applications

April 10, 2024

robots fighting — Image generated with Bing Image Creator

Developers and researchers are finding ways to extend the capabilities of large language models (LLM) beyond text-based applications. One interesting niche that is getting attention is video games, where the LLM is provided with instructions for playing the game and left to find ways to achieve goals and maximize its score.

Banjo Obayomi, Senior Developer Advocate at Amazon Web Services (AWS), recently created a project in which he pitted various frontier LLMs against each other in Street Fighter III.

While seeing LLMs play games is entertaining, the findings can have important implications for real-world applications that require real-time responses and dynamic decision-making.

“As a builder and gamer, I’ve always been fascinated with AI agents playing video games,” Obayomi told TechTalks.

The classic technique for creating game-playing AI systems is to use reinforcement learning (RL). In such settings, an RL agent must play millions of rounds to learn policies that can maximize a reward in the game environment. RL-based systems have made great breakthroughs in the past decade. However, training reinforcement learning agents is very complicated and requires a lot of time and expensive compute resources.

“With LLMs the model has already gone through a pre-training phase and can be used right away just by using a prompt,” Obayomi said.

Obayomi was inspired by LLM Colosseum, an open source project that pits OpenAI and Mistral models against each other in Street Fighter III. The project uses Diambra, an emulator for creating AI agents that play different games.

Each model is given a description of the game, its current state, its previous moves and the move of its opponent. It must choose its next move based on the information. The actions are then sent to the emulator and the results are sent back to the models to choose their next moves. What’s interesting about this approach is that the model is not trained on any previous game data and uses pure in-context learning to choose its actions.

Obayomi used Amazon’s Bedrock platform, which provides serverless access to a wide range of LLMs, including models from Anthropic, AI21 Labs, Cohere, and Mistral (but not OpenAI).

Obayomi ran 314 matches with 14 models. Interestingly, the highest Elo ranking belonged to Haiku, the smallest and fastest version of Anthrophic’s Claude 3 model. “The smaller models outperformed larger models in the arena likely due to their lower latency which allowed for quicker reaction times and more moves per match,” Obayomi writes in his blog post.

The original LLM Colosseum, created by Quivr CEO Stan Girard, gave ChatGPT-3.5 Turbo the highest Elo rating, possibly also due to its very fast inference speed.

The experiment also highlighted some of the limitations of current LLMs. For example, in some cases, the models would try to make moves that did not exist within the game, possibly due to hallucination. In another case, Claude 2.1 refused to play the game to avoid promoting violence. On the other hand, Claude 3 could detect that the context of the request was a game and complied with the prompts (I wonder if that could turn out to be a setting for a jailbreak).

LLM Street Fighter III Elo ratings — Elo ratings of LLMs playing Street Fighter III (powered by Amazon Bedrock)

There are a few things that make real-time games an interesting field to study. First, they require the right balance between speed and accuracy. Second, they require good adaptation to contextual information. And third, they require the agent to have enough memory to learn through the length of an episode. There are various real-world applications that have such requirements.

“Most folks have interacted with LLMs in a chatbot-type session in a non-real-time environment,” Obayomi said. “This experiment demonstrates that we can start bringing LLMs to perform real-time tasks, such as perhaps dialogue navigation like having an LLM in an earpiece to help what to say and live broadcasting, having an LLM commentator on a sports game, or a video game stream.”

While Street Fighter III rounds are short, it will be interesting to see how LLM agents could be applied to more complicated games that have longer durations. Obayomi will be testing out other games in the future. He has released the code for his project on GitHub and you can try it out for yourself.

A few ways come to mind to improve the project. For example, one modification would be to create a feedback loop that helps the LLM improve its gameplay by reflecting on its actions after playing the game. Once the game is over, the model (or a stronger model) can be given the full game history and told to analyze the moves and determine which kinds of actions were more successful. Those reflections can then be turned into tactical instructions and added to the system prompt used in the next round. This way, the model can learn to improve itself after each round. This is an example of using in-context learning to change the LLM’s behavior without fine-tuning it. DeepMind’s OPRO provides an interesting framework for such self-optimizations.

Another possible improvement would be to play the game strong model in a controlled setting where speed is not a limitation. This will allow the model to use its full accuracy without being limited by time constraints. The output can then be used to fine-tune a small and fast model for that specific task. This would be an example of distillation. It will be interesting to see what new experiences can be drawn from this experiment.

What OpenELM language models say about Apple’s generative AI strategy

Will infinite context windows kill LLM fine-tuning and RAG?

How to turn any LLM into an embedding model

AI in healthcare: Real-world applications for cost-savings and innovation

Stanford’s ReFT fine-tunes LLMs at a fraction of the cost

Fine-tune a Llama-2 language model with a single instruction

What to know about the rising threat of deepfake scams

4 reasons to use open-source LLMs (especially after the OpenAI drama)

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT

How to make your LLMs lighter with GPTQ quantization

What to know about open-source alternatives to GPT-4 Vision

The complete guide to LLM compression

A simple guide to gradient descent in machine learning

The complete guide to LLM fine-tuning

What is low-rank adaptation (LoRA)?

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Democratizing the hardware side of large language models

LLMs battle it out in Street Fighter—here’s what it means for real applications

Like this:

Leave a ReplyCancel reply

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks