Blog

Beyond context windows, here is how the memory of AI agents is evolving

August 31, 2025

AI agent memory — Image created with ChatGPT

On paper, AI agents promise to become autonomous workers that can handle complex, multi-step business processes. But in practice, today’s agents are often fragile. They can fail in unpredictable ways that make them unreliable (for example, a customer support bot might forget an earlier refund request, or a planning assistant can lose track of your dietary preferences between sessions).

One of the key reasons behind this unreliability is the challenges of managing the memory of AI agents. The solution isn’t simply a larger context window or a better RAG system but a fundamental rethinking of how an AI agent remembers, learns, and adapts. New agent-oriented frameworks are pioneering these advanced memory architectures, moving beyond simple recall to enable dynamic, evolving, and context-aware agents.

But solving the memory bottleneck is only the first step. True autonomy will require evolving the entire ecosystem, from the core models themselves to the digital world they operate in.

The memory bottleneck of AI agents

Large language models (LLMs), the backbone of AI agents, continue to support larger context windows, enabling them to fit and retrieve information from full books and knowledge corpora. But the challenge of agentic memory goes far beyond the limits of a model’s context window.

Even context windows reaching millions of tokens are not a complete solution for two key reasons. First, as a human-AI relationship develops over weeks or months, the conversation history will inevitably grow beyond any fixed limit. Second, simply feeding an LLM with a longer context does not guarantee it will effectively use past information.

The attention mechanisms that weigh the importance of different inputs can degrade over long distances, meaning information buried deep in a conversation might be overlooked. Moreover, a chat history can contain conflicting and distracting information, making it difficult for the LLM and the agent to find the most useful bits for each action they take.

Agents also face a more chaotic reality than simple chatbots. They interact with dynamic environments where unpredictable events can happen, like changes to user interfaces, APIs, or simple errors such as a website being out of service. Any of these events can derail a process that relies on static memories.

For many current agents, this means starting over from scratch. Their memory systems are often based on rigid, hand-crafted prompt templates or knowledge embedded in the model’s parameters, which are slow and expensive to update. This prevents them from adapting to new situations or learning from their mistakes. Rigid memory structures can severely restrict an agent’s ability to generalize across new environments and remain effective in long-term interactions.

The emerging architectures of agentic memory

To solve these challenges, researchers are designing sophisticated frameworks that treat memory as a core, dynamic component. These new approaches differ in the type of knowledge they manage and how they structure it.

Memp framework — Mem^p framework (source: arXiv)

One approach focuses on procedural memory, or the “how-to” knowledge an agent gains from experience. A framework called Memp, developed by researchers at Zhejiang University and Alibaba Group, gives agents a procedural memory that is continuously updated. It works in a loop of three key stages: building, retrieving, and updating.

The process begins by building a memory from an agent’s past experiences, or “trajectories,” storing them as either step-by-step actions or higher-level scripts. When the agent faces a new task, it retrieves the most relevant past experience to use as a guide. This gives it a baseline that is a huge upgrade from randomly trying out different actions.

Most importantly, Memp includes an update mechanism that allows the agent to reflect on failures to correct and revise the original memory. This focus on learning across different tasks prevents the agent from having to re-explore from scratch each time, allowing it to improve with practice and adapt to changes in the environment.

One of the key findings of the Memp study is that memory is transferable. In one experiment, procedural memory generated by GPT-4o was given to a much smaller model, Qwen2.5. The smaller model saw a significant boost in performance, suggesting that enterprises can acquire knowledge with a powerful model and then deploy it on smaller, more cost-effective ones.

A second approach focuses on declarative memory, or remembering “what happened,” to maintain conversational coherence. The Mem0 framework is designed to dynamically capture and organize key information from ongoing conversations.

Its architecture uses a two-phase pipeline. In the extraction phase, it processes a new message exchange and extracts a set of important “candidate facts.” The update phase then leverages the LLM’s own reasoning to evaluate these facts against existing memories. It decides whether to add a new fact, update an existing one with complementary information, delete a memory if it’s contradicted, or do nothing.

For more complex reasoning, an enhanced version called Mem0g organizes these facts into a knowledge graph. In this graph, entities like people and places are nodes, and their relationships are edges. This structure enables the agent to navigate complex relational paths and answer multi-hop questions, such as “Who approved that budget, and when?”

A third approach empowers agents with a self-organizing memory that can discover connections on its own. The A-MEM framework enables agents to autonomously create and link “memory notes” without predefined rules. Each time an agent interacts with its environment, A-MEM generates a structured memory note that captures information and metadata like time and keywords. It then uses a two-step process to link this new note to existing ones. First, it uses efficient embedding-based retrieval to identify a list of potential memories to connect with. Then, an LLM analyzes the full content of these candidates to make a nuanced decision about which ones are most suitable to link. This goes beyond simple similarity metrics, allowing the system to discover higher-order patterns as its knowledge base grows.

How ‘semantic chaining’ jailbreaks image generation models

How Sakana AI’s new technique solves the problems of long-context LLM…

Smarter trade: How AI turns regulatory burden into competitive edge

Recursive Language Models: A new framework for infinite context in LLMs

Microsoft’s new Rho-alpha model brings tactile sensing to robotics

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

OpenAI’s GPT-5: A reality check for the AI hype train

OpenAI’s grand return to open source: unpacking the gpt-oss release

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Beyond context windows, here is how the memory of AI agents is evolving

The memory bottleneck of AI agents

The emerging architectures of agentic memory

Like this:

The memory bottleneck of AI agents

The emerging architectures of agentic memory

Subscribe to continue reading

Like this: