What is...

What to know about Claude Opus 4.5

November 25, 2025

The AI industry moves fast, but the week of November 18, 2025, may go down as one of the most chaotic in its history. In a span of just seven days, the definition of “state-of-the-art” shifted three times. First, Google’s Gemini 3 Pro claimed the top spot on November 18. The very next day, OpenAI released GPT-5.1-Codex-Max, taking the crown. Then, on November 24, Anthropic released Claude Opus 4.5, its new flagship large language model (LLM), positioning it as the new “best model in the world” specifically for coding, agents, and computer use.

Under the hood: Claude Opus 4.5 architecture

Unfortunately, Claude models have the lowest transparency when it comes to model architecture. But from the system card (Anthropic doesn’t use the term “model card,” indicating that you’re not directly interacting with the LLM when you send requests to Claude), we know that Claude Opus 4.5 is a “hybrid reasoning” model, which means a single model trained for both direct inference and chain-of-thought (CoT) reasoning.

Similar to the architecture used since Sonnet 3.7, Opus 4.5 distinguishes between a default mode for rapid responses and an “extended thinking” mode. This allows the model to deliberate on complex problems before generating an output, effectively showing its work internally before presenting a solution.

The most significant architectural introduction is the new “effort” parameter. This feature hands control of the cost-versus-intelligence trade-off directly to the user. Developers can toggle between Low, Medium, and High effort settings. At a Medium setting, the model attempts to balance speed and smarts, while the High setting allows the model to maximize its reasoning capabilities, regardless of token consumption. This applies to all tokens, including thinking tokens and function calls. The model was trained on data up to May 2025, with a 200,000-token context window and a 64,000-token output limit, mirroring the capacity of the previous Sonnet model. (The small context window is a bit disappointing, given that Gemini has been supporting 1 million tokens for more than a year now.)

Performance and benchmarks

Performance metrics suggest that this hybrid approach is paying dividends in technical domains. On the SWE-bench Verified benchmark, which tests real-world software engineering capabilities, Opus 4.5 achieved a score of 80.9% when set to High effort. This figure surpasses both the recent GPT-5.1 and Gemini 3 Pro releases (the chart is a bit misleading though, because the y-axis doesn’t start at zero, thus visually exaggerating Claude’s lead).

Perhaps more telling for the future of automated labor, the model outperformed every human candidate on Anthropic’s own internal technical take-home exam, a test specifically designed to assess judgment and technical ability under time pressure. (However, as I have pointed out time and again, judging LLMs based on tests designed for humans can be misleading.)

However, raw coding scores only tell part of the story. The model’s ability to handle ambiguity and navigate complex systems suggests a leap in “agentic” behavior (the ability of an AI to act autonomously to solve a problem). In one instance involving the τ2-bench benchmark, the model was tasked with acting as an airline service agent for a customer who wanted to change a basic economy ticket, a request the airline’s policy strictly forbids. Instead of simply refusing the request, Opus 4.5 found a legitimate loophole: it upgraded the cabin class first, which unlocked the ability to modify the flight. While the benchmark technically scored this as a failure because it expected a refusal, the result demonstrates the kind of creative problem-solving that enterprises look for in autonomous agents.

We had to remove the τ2-bench airline eval from our benchmarks table because Opus 4.5 broke it by being too clever.

The benchmark simulates an airline customer service agent. In one test case, a distressed customer calls in wanting to change their flight, but they have a basic… pic.twitter.com/QIthW3PTq2
— Alex Albert (@alexalbert__) November 24, 2025

Despite these benchmark victories, determining the practical difference between these frontier models is becoming increasingly difficult for individual developers. Simon Willison, a prominent developer and tech blogger, noted that while Opus 4.5 successfully handled large-scale refactoring in his projects (managing 20 commits and changing nearly 40 files) he experienced little drop-off in productivity when he reverted to the older Sonnet 4.5 model.

If you enjoyed this article, please consider supporting TechTalks with a paid subscription (and gain access to subscriber-only posts)

Subscribe to TechTalks

This highlights a growing “evaluation crisis” where benchmarks show single-digit percentage improvements that may not immediately translate into noticeable workflow changes for daily tasks. (I’ve had similar experiences in comparing Gemini 3.0 to 2.5 on normal everyday tasks.)

One area where tangible progress is visible is efficiency. As models become more intelligent, they can theoretically solve problems with less backtracking and redundant reasoning. Anthropic claims that at a Medium effort level, Opus 4.5 matches the best score of Sonnet 4.5 on the SWE-bench Verified test but does so using 76% fewer output tokens. This efficiency is critical for cost-conscious engineering teams running automated agents that operate continuously.

Safety remains a central concern, particularly regarding prompt injection. While Opus 4.5 is statistically harder to trick than its competitors, it is not immune. Data shows that a single prompt injection attempt has a success rate of roughly 4.7%. However, if an attacker is persistent and tries ten different attacks, the success rate jumps to 33.6%. (It is still an improvement over competitors like Gemini 3 Pro, which had a 60.7% failure rate at ten attempts.)

Availability and pricing

Access to Opus 4.5 is immediate across Anthropic’s API and major cloud providers, including AWS Bedrock and Google Vertex AI. Alongside the model, Anthropic has released updates to the Claude Developer Platform and Claude Code. These include a new “Plan Mode” for Claude Code that builds user-editable plans before execution, and a “Zoom” tool that allows the model to inspect specific regions of a screen (useful for computer use tasks).

Opus 4.5’s efficiency gain is paired with a significant price reduction. Opus 4.5 is priced at $5 per million input tokens and $25 per million output tokens. This represents a 3x price cut compared to the previous Opus 4.1 model, making it a viable workhorse model rather than a luxury tool reserved for only the most difficult prompts. (It is worth noting, however, that Claude Opus 4.5 is still considerably more expensive than rivals such as Gemini 3.0 Pro and GPT-5.1.)

Microsoft’s new Rho-alpha model brings tactile sensing to robotics

Vulnerability in Perplexity’s BrowseSafe shows why single models can’t stop prompt…

How test-time training allows models to ‘learn’ long documents instead of…

VL-JEPA is a lean, fast vision-language model that rivals the giants

URM shows how small, recurrent models can outperform big LLMs in…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?