Salesforce tackles the ‘brittleness’ of web agents with new WALT framework

computer use agent

This article is part of our coverage of the latest in AI research.

Salesforce Research has introduced a new framework that enables web agents to better navigate websites by leveraging the native tools and features those sites already contain. The framework, called WALT (Web Agents that Learn Tools), examines a website’s functionality to extract tools that abstract away low-level execution. This means that instead of reasoning about how to click and type, agents simply think in terms of tools such as searching, filtering, or adding new items to lists. WALT outperforms other web agent frameworks on key industry benchmarks and paves the way for more robust and versatile web agents.

The challenge of web navigation

Web agents are designed to automate complex browser tasks. But current methods for designing web agents are brittle, relying on step-by-step user interface interactions and heavy LLM reasoning that fall apart in dynamic site layouts and long-horizon tasks.

In contrast, when humans navigate the web, they abstract away implementation details and focus on what they want to accomplish, not how the interface mechanics work. For example, humans think about websites in terms of high-level operations like search, filter, and sort. They leverage this prior knowledge to recognize reusable patterns across websites, allowing them to quickly adapt their interactions to new layouts.

Previous attempts to solve this problem for web agents have focused on discovering “skills,” which are reusable action sequences that encapsulate common interaction patterns. However, existing skill discovery approaches suffer from two key limitations. First, they either mine skills only from successful trajectories or require agents to hypothesize useful automations, often yielding unintuitive, overly specific, or irrelevant skills. Second, both approaches implement skills as brittle UI action sequences, highly sensitive to dynamic elements and design changes in the website.

WALT: Web Agents that Learn Tools

WALT takes a different approach. Unlike prior skills or workflows, WALT’s tools correspond to functionality that site designers have already engineered as robust automations (e.g., search bars, filters, sorting mechanisms, commenting systems, and navigation controls). Each tool is exposed to the agent as a high-level deterministic function (e.g., search()), and the underlying implementation is discovered and validated through a reverse-engineering process using LLM agents. As the researchers state in their paper, “This reframing shifts the agent’s capability frontier: instead of learning brittle approximations of interaction patterns, WALT surfaces the functionality already embedded in websites as reliable, reusable tools.”

WALT tool discovery framework
WALT tool discovery framework (source: arXiv)

To achieve this, WALT follows a “demonstrate-generate-validate” loop for each identified tool on a given website. First, a web agent comprehensively demonstrates the functionality, such as cycling through all filters and sort options for a search feature. The agent then compiles a list of candidates for reusable tools that have clear user intents.

Next, a tool generation agent maps execution traces to structured tools with validated input schemas. This generation process prioritizes deterministic actions but allows agentic steps for dynamic elements. Where applicable, the agent replaces UI sequences with more robust URL manipulation through API reverse-engineering (this makes the tool robust to changes in the site’s graphic design or layout). Finally, a browser agent verifies functionality against pre-vetted test inputs. Each tool is realized as an action script containing a finite sequence of navigation, extraction, UI interaction, and agentic steps that the browser agent executes automatically.

The goal is to expose tools as atomic actions, meaning the agent calls them by contract and relies on their internal execution without having to reason about intermediate steps. “For WALT, we equip the tool-enriched agent with agentic fallback—a last resort that spawns a fresh agent to handle a failing action script on the fly,” says Ran Xu, Director of Applied AI Research at Salesforce and co-author of the paper.

Crucially, only tools that pass validation within a fixed number of attempts are exposed to the agent at runtime. Tool discovery and optimization happen offline during website exploration, which makes the system efficient and reliable during live tasks. 

The overall abstraction that WALT provides transforms the agent’s computational burden: instead of reasoning about complex UI sequences like “how do I search for X, then filter by Y,” the agent simply calls high-level functions and focuses on higher-level planning.

WALT at runtime
How WALT works at runtime (ignores low-level details and thinks in terms of abstract tools) (source: arXiv)

WALT in action

The researchers evaluated WALT on two established benchmarks: VisualWebArena and WebArena. VisualWebArena contains visually-grounded, human-annotated tasks that evaluate multimodal understanding on three real websites (Classifieds, Shopping, and Reddit). WebArena includes more general tasks in different domains (GitLab, Map, Shopping, CMS, and Reddit).

For their base WALT agent, the researchers paired a VLM planner powered by GPT-5 with a browser action executor using standard web actions. They compared WALT against a representative set of state-of-the-art methods, including skill-based web agents like SkillWeaver and the Claude computer use agent.

The results show that WALT achieves state-of-the-art success rates of 52.9% on VisualWebArena and 50.1% on WebArena, significantly outperforming prior methods and agents. Beyond raw success rates, the use of tools improves efficiency. Their experiments showed that the discovered tools, along with multimodal DOM parsing and external verification, improved the agent’s success rates while reducing the average number of steps to complete tasks, making it significantly more efficient. (It is worth noting that while WALT beats other agents by a fair margin on most tasks, it is still well below average human performance, which highlights the room for improvement.)

WALT performance
WALT performance on industry benchmarks (source: arXiv)

Despite these successes, the method has limitations. Offline tool discovery incurs an extra exploration and validation cost per website, and the quality of tools depends on what the site exposes. “For more practical use, we recommend constructing test cases that run periodically to identify agent failure due to obsolete tools… then conduct ‘patching’ only on identified websites,” says Xu.

Additionally, highly dynamic interfaces, A/B experiments, CAPTCHAs, and heavy anti-automation measures can hamper WALT’s ability to discover tools and use them reliably. “Some of the UI actions may not be able to be reverse-engineered into reusable tools,” Xu notes. “We think WALT is a good addition to browser agent… but encourage exploring all possible resources—MCP / API tools, documents as memory, etc.”

The researchers suggest their tool abstraction paradigm can create “a practical path for safe, auditable automation: tools carry explicit contracts, examples, and validation traces, making web agents easier to monitor, share, and maintain as sites evolve.” Looking ahead, Xu envisions a broader scope: “I think LLM and agents should be able to learn world knowledge of websites, human-website interactions, and exhibit some sort of zero-shot ability.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.