Home Blog Page 3

Why AI benchmarks are broken

LLM benchmark race

Subscribe to continue reading

Become a paid subscriber to get access to the rest of this post and other exclusive content.

Salesforce tackles the ‘brittleness’ of web agents with new WALT framework

computer use agent
computer use agent

This article is part of our coverage of the latest in AI research.

Salesforce Research has introduced a new framework that enables web agents to better navigate websites by leveraging the native tools and features those sites already contain. The framework, called WALT (Web Agents that Learn Tools), examines a website’s functionality to extract tools that abstract away low-level execution. This means that instead of reasoning about how to click and type, agents simply think in terms of tools such as searching, filtering, or adding new items to lists. WALT outperforms other web agent frameworks on key industry benchmarks and paves the way for more robust and versatile web agents.

Beyond raw intelligence: How Poetiq cracked the ARC-AGI-2 benchmark

AI puzzle solving
AI puzzle solving

AI lab Poetiq has officially topped the ARC-AGI-2 leaderboard with an approach that hints at a significant shift in how AI systems solve complex reasoning tasks. On November 20, 2025, the company announced preliminary results that have now been verified by the ARC Prize team. The Poetiq system achieved a score of 54% on the Semi-Private Test Set, significantly outperforming the previous state-of-the-art held by Gemini 3 Deep Think, which scored 45%.

Beyond the accuracy gains, Poetiq’s system reached this milestone at a cost of $30.57 per problem, compared to the $77.16 per problem cost of Gemini 3 Deep Think. This result suggests that progress in AI reasoning is moving away from purely scaling model size and reasoning tokens and toward the implementation of well-engineered systems that optimize performance at the application layer.

SOUNDPEATS Clip1 review: Open-ear audio with all-day comfort

SOUNDPEATS Pearl Clip1
SOUNDPEATS Pearl Clip1
SOUNDPEATS Pearl Clip1
SOUNDPEATS Pearl Clip1

The SOUNDPEATS Clip1 are open-ear, clip-on earbuds designed for users who prioritize comfort and situational awareness. Positioned as an alternative to traditional in-ear and bone conduction models, they are intended for long listening sessions. As someone who has used other SOUNDPEATS open-ear devices, I found them to be useful for various settings, including office use, and outdoor activities like running or cycling. The earbuds retail for $69.99, placing them in the affordable segment of the open-ear audio market.

What makes DeepSeek-V3.2 so efficient?

DeepSeek-V3.2
DeepSeek-V3.2

Just when there was growing concern that DeepSeek was a flash in the pan, the Chinese AI lab released the production ready DeepSeek-V3.2, one of two best open source and a top-five overall large language model. 

DeepSeek-V3.2 performs impressively well on a wide range of benchmarks, per its own reporting as well as the independent tests. As of the time of this writing, DeepSeek-V3.2 stands in fifth place on the Artificial Analysis index, behind Kimi K2 Thinking and ahead of Grok 4.

OpenAI’s code red: The curse of being at the forefront of AI

OpenAI code red
OpenAI code red

OpenAI is scrambling to recover from Google’s huge AI comeback after the latter released Gemini 3.0 Pro and Nano Banana Pro. OpenAI Sam Altman has declared “Code Red,” according to the Information, and has warned: “We are at a critical time for ChatGPT.” The company is reportedly cancelling plans for ads and other products and is focusing on releasing its next model that will outperform Gemini 3.

OpenAI is not a profitable company (even with around $20 billion in annual recurring revenue). It needs to raise capital from investors to fund its next generation of models and products. It has managed to raise tens of billions of dollars on the premise and promise that it is and will remain the undisputed leader in AI. With the sentiment being that it is no longer in the lead, it is less likely to get the next funding round, unless it comes up with a convincing plan to take back the lead.

What is next in reinforcement learning for LLMs?

LLM reinforcement learning

Subscribe to continue reading

Become a paid subscriber to get access to the rest of this post and other exclusive content.

Prompt injection attack tricks Google’s Antigravity into stealing your secrets

vulnerable IDE
vulnerable IDE

A newly discovered vulnerability in Google’s Antigravity platform demonstrates how its autonomous AI agents can be manipulated into exfiltrating sensitive data from a developer’s environment. Security researchers at PromptArmor found that an indirect prompt injection, hidden within a seemingly harmless online document, can coerce Antigravity’s AI into bypassing its own security settings to steal credentials and proprietary code. The attack leverages the very agentic capabilities that Google promotes as the platform’s core strength.

What to know about Claude Opus 4.5

Anthropic Claude
Anthropic Claude

The AI industry moves fast, but the week of November 18, 2025, may go down as one of the most chaotic in its history. In a span of just seven days, the definition of “state-of-the-art” shifted three times. First, Google’s Gemini 3 Pro claimed the top spot on November 18. The very next day, OpenAI released GPT-5.1-Codex-Max, taking the crown. Then, on November 24, Anthropic released Claude Opus 4.5, its new flagship large language model (LLM), positioning it as the new “best model in the world” specifically for coding, agents, and computer use.

What is next for Yann LeCun after his departure from Meta?

World models

Subscribe to continue reading

Become a paid subscriber to get access to the rest of this post and other exclusive content.