LLMs battle it out in Street Fighter—here’s what it means for real applications

robots fighting
Image generated with Bing Image Creator

Developers and researchers are finding ways to extend the capabilities of large language models (LLM) beyond text-based applications. One interesting niche that is getting attention is video games, where the LLM is provided with instructions for playing the game and left to find ways to achieve goals and maximize its score.

Banjo Obayomi, Senior Developer Advocate at Amazon Web Services (AWS), recently created a project in which he pitted various frontier LLMs against each other in Street Fighter III.

While seeing LLMs play games is entertaining, the findings can have important implications for real-world applications that require real-time responses and dynamic decision-making.

“As a builder and gamer, I’ve always been fascinated with AI agents playing video games,”  Obayomi told TechTalks. 

The classic technique for creating game-playing AI systems is to use reinforcement learning (RL). In such settings, an RL agent must play millions of rounds to learn policies that can maximize a reward in the game environment. RL-based systems have made great breakthroughs in the past decade. However, training reinforcement learning agents is very complicated and requires a lot of time and expensive compute resources.

“With LLMs the model has already gone through a pre-training phase and can be used right away just by using a prompt,” Obayomi said.

Obayomi was inspired by LLM Colosseum, an open source project that pits OpenAI and Mistral models against each other in Street Fighter III. The project uses Diambra, an emulator for creating AI agents that play different games.

Each model is given a description of the game, its current state, its previous moves and the move of its opponent. It must choose its next move based on the information. The actions are then sent to the emulator and the results are sent back to the models to choose their next moves. What’s interesting about this approach is that the model is not trained on any previous game data and uses pure in-context learning to choose its actions.

Obayomi used Amazon’s Bedrock platform, which provides serverless access to a wide range of LLMs, including models from Anthropic, AI21 Labs, Cohere, and Mistral (but not OpenAI). 

Obayomi ran 314 matches with 14 models. Interestingly, the highest Elo ranking belonged to Haiku, the smallest and fastest version of Anthrophic’s Claude 3 model. “The smaller models outperformed larger models in the arena likely due to their lower latency which allowed for quicker reaction times and more moves per match,” Obayomi writes in his blog post.

The original LLM Colosseum, created by Quivr CEO Stan Girard, gave ChatGPT-3.5 Turbo the highest Elo rating, possibly also due to its very fast inference speed.

The experiment also highlighted some of the limitations of current LLMs. For example, in some cases, the models would try to make moves that did not exist within the game, possibly due to hallucination. In another case, Claude 2.1 refused to play the game to avoid promoting violence. On the other hand, Claude 3 could detect that the context of the request was a game and complied with the prompts (I wonder if that could turn out to be a setting for a jailbreak).

LLM Street Fighter III Elo ratings
Elo ratings of LLMs playing Street Fighter III (powered by Amazon Bedrock)

There are a few things that make real-time games an interesting field to study. First, they require the right balance between speed and accuracy. Second, they require good adaptation to contextual information. And third, they require the agent to have enough memory to learn through the length of an episode. There are various real-world applications that have such requirements.

“Most folks have interacted with LLMs in a chatbot-type session in a non-real-time environment,” Obayomi said. “This experiment demonstrates that we can start bringing LLMs to perform real-time tasks, such as perhaps dialogue navigation like having an LLM in an earpiece to help what to say and live broadcasting, having an LLM commentator on a sports game, or a video game stream.”

While Street Fighter III rounds are short, it will be interesting to see how LLM agents could be applied to more complicated games that have longer durations. Obayomi will be testing out other games in the future. He has released the code for his project on GitHub and you can try it out for yourself. 

A few ways come to mind to improve the project. For example, one modification would be to create a feedback loop that helps the LLM improve its gameplay by reflecting on its actions after playing the game. Once the game is over, the model (or a stronger model) can be given the full game history and told to analyze the moves and determine which kinds of actions were more successful. Those reflections can then be turned into tactical instructions and added to the system prompt used in the next round. This way, the model can learn to improve itself after each round. This is an example of using in-context learning to change the LLM’s behavior without fine-tuning it. DeepMind’s OPRO provides an interesting framework for such self-optimizations.

Another possible improvement would be to play the game strong model in a controlled setting where speed is not a limitation. This will allow the model to use its full accuracy without being limited by time constraints. The output can then be used to fine-tune a small and fast model for that specific task. This would be an example of distillation. It will be interesting to see what new experiences can be drawn from this experiment.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.