Blog

Salesforce tackles the ‘brittleness’ of web agents with new WALT framework

December 11, 2025

This article is part of our coverage of the latest in AI research.

Salesforce Research has introduced a new framework that enables web agents to better navigate websites by leveraging the native tools and features those sites already contain. The framework, called WALT (Web Agents that Learn Tools), examines a website’s functionality to extract tools that abstract away low-level execution. This means that instead of reasoning about how to click and type, agents simply think in terms of tools such as searching, filtering, or adding new items to lists. WALT outperforms other web agent frameworks on key industry benchmarks and paves the way for more robust and versatile web agents.

The challenge of web navigation

Web agents are designed to automate complex browser tasks. But current methods for designing web agents are brittle, relying on step-by-step user interface interactions and heavy LLM reasoning that fall apart in dynamic site layouts and long-horizon tasks.

In contrast, when humans navigate the web, they abstract away implementation details and focus on what they want to accomplish, not how the interface mechanics work. For example, humans think about websites in terms of high-level operations like search, filter, and sort. They leverage this prior knowledge to recognize reusable patterns across websites, allowing them to quickly adapt their interactions to new layouts.

Previous attempts to solve this problem for web agents have focused on discovering “skills,” which are reusable action sequences that encapsulate common interaction patterns. However, existing skill discovery approaches suffer from two key limitations. First, they either mine skills only from successful trajectories or require agents to hypothesize useful automations, often yielding unintuitive, overly specific, or irrelevant skills. Second, both approaches implement skills as brittle UI action sequences, highly sensitive to dynamic elements and design changes in the website.

WALT: Web Agents that Learn Tools

WALT takes a different approach. Unlike prior skills or workflows, WALT’s tools correspond to functionality that site designers have already engineered as robust automations (e.g., search bars, filters, sorting mechanisms, commenting systems, and navigation controls). Each tool is exposed to the agent as a high-level deterministic function (e.g., search()), and the underlying implementation is discovered and validated through a reverse-engineering process using LLM agents. As the researchers state in their paper, “This reframing shifts the agent’s capability frontier: instead of learning brittle approximations of interaction patterns, WALT surfaces the functionality already embedded in websites as reliable, reusable tools.”

WALT tool discovery framework (source: arXiv)

To achieve this, WALT follows a “demonstrate-generate-validate” loop for each identified tool on a given website. First, a web agent comprehensively demonstrates the functionality, such as cycling through all filters and sort options for a search feature. The agent then compiles a list of candidates for reusable tools that have clear user intents.

Next, a tool generation agent maps execution traces to structured tools with validated input schemas. This generation process prioritizes deterministic actions but allows agentic steps for dynamic elements. Where applicable, the agent replaces UI sequences with more robust URL manipulation through API reverse-engineering (this makes the tool robust to changes in the site’s graphic design or layout). Finally, a browser agent verifies functionality against pre-vetted test inputs. Each tool is realized as an action script containing a finite sequence of navigation, extraction, UI interaction, and agentic steps that the browser agent executes automatically.

The goal is to expose tools as atomic actions, meaning the agent calls them by contract and relies on their internal execution without having to reason about intermediate steps. “For WALT, we equip the tool-enriched agent with agentic fallback—a last resort that spawns a fresh agent to handle a failing action script on the fly,” says Ran Xu, Director of Applied AI Research at Salesforce and co-author of the paper.

Crucially, only tools that pass validation within a fixed number of attempts are exposed to the agent at runtime. Tool discovery and optimization happen offline during website exploration, which makes the system efficient and reliable during live tasks.

The overall abstraction that WALT provides transforms the agent’s computational burden: instead of reasoning about complex UI sequences like “how do I search for X, then filter by Y,” the agent simply calls high-level functions and focuses on higher-level planning.

WALT at runtime — How WALT works at runtime (ignores low-level details and thinks in terms of abstract tools) (source: arXiv)

WALT in action

If you enjoyed this article, please consider supporting TechTalks with a paid subscription (and gain access to subscriber-only posts)

Subscribe to TechTalks

The researchers evaluated WALT on two established benchmarks: VisualWebArena and WebArena. VisualWebArena contains visually-grounded, human-annotated tasks that evaluate multimodal understanding on three real websites (Classifieds, Shopping, and Reddit). WebArena includes more general tasks in different domains (GitLab, Map, Shopping, CMS, and Reddit).

For their base WALT agent, the researchers paired a VLM planner powered by GPT-5 with a browser action executor using standard web actions. They compared WALT against a representative set of state-of-the-art methods, including skill-based web agents like SkillWeaver and the Claude computer use agent.

The results show that WALT achieves state-of-the-art success rates of 52.9% on VisualWebArena and 50.1% on WebArena, significantly outperforming prior methods and agents. Beyond raw success rates, the use of tools improves efficiency. Their experiments showed that the discovered tools, along with multimodal DOM parsing and external verification, improved the agent’s success rates while reducing the average number of steps to complete tasks, making it significantly more efficient. (It is worth noting that while WALT beats other agents by a fair margin on most tasks, it is still well below average human performance, which highlights the room for improvement.)

WALT performance on industry benchmarks (source: arXiv)

Despite these successes, the method has limitations. Offline tool discovery incurs an extra exploration and validation cost per website, and the quality of tools depends on what the site exposes. “For more practical use, we recommend constructing test cases that run periodically to identify agent failure due to obsolete tools… then conduct ‘patching’ only on identified websites,” says Xu.

Additionally, highly dynamic interfaces, A/B experiments, CAPTCHAs, and heavy anti-automation measures can hamper WALT’s ability to discover tools and use them reliably. “Some of the UI actions may not be able to be reverse-engineered into reusable tools,” Xu notes. “We think WALT is a good addition to browser agent… but encourage exploring all possible resources—MCP / API tools, documents as memory, etc.”

The researchers suggest their tool abstraction paradigm can create “a practical path for safe, auditable automation: tools carry explicit contracts, examples, and validation traces, making web agents easier to monitor, share, and maintain as sites evolve.” Looking ahead, Xu envisions a broader scope: “I think LLM and agents should be able to learn world knowledge of websites, human-website interactions, and exhibit some sort of zero-shot ability.”

How Cursor’s Composer 2.5 uses self-distillation to beat the frontier LLMs…

Vertical integration as AI infrastructure: What 21D’s full arch implant system…

Why sandboxing OpenClaw doesn’t stop data exfiltration

Google brings multi-token prediction Gemma 4 LLMs

How Memory Sparse Attention scales LLM memory to 100 million tokens

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

OpenAI’s GPT-5: A reality check for the AI hype train

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Salesforce tackles the ‘brittleness’ of web agents with new WALT framework

The challenge of web navigation

WALT: Web Agents that Learn Tools

WALT in action

Like this:

Leave a ReplyCancel reply

The challenge of web navigation

WALT: Web Agents that Learn Tools

WALT in action

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks