Why AI benchmarks are broken

By

Ben Dickson

-

December 15, 2025

Salesforce tackles the ‘brittleness’ of web agents with new WALT framework

By

Ben Dickson

-

December 11, 2025

This article is part of our coverage of the latest in AI research.

Salesforce Research has introduced a new framework that enables web agents to better navigate websites by leveraging the native tools and features those sites already contain. The framework, called WALT (Web Agents that Learn Tools), examines a website’s functionality to extract tools that abstract away low-level execution. This means that instead of reasoning about how to click and type, agents simply think in terms of tools such as searching, filtering, or adding new items to lists. WALT outperforms other web agent frameworks on key industry benchmarks and paves the way for more robust and versatile web agents.

Continue

Beyond raw intelligence: How Poetiq cracked the ARC-AGI-2 benchmark

By

Ben Dickson

-

December 9, 2025

AI lab Poetiq has officially topped the ARC-AGI-2 leaderboard with an approach that hints at a significant shift in how AI systems solve complex reasoning tasks. On November 20, 2025, the company announced preliminary results that have now been verified by the ARC Prize team. The Poetiq system achieved a score of 54% on the Semi-Private Test Set, significantly outperforming the previous state-of-the-art held by Gemini 3 Deep Think, which scored 45%.

We have verified a new SOTA Gemini 3 Pro Refinement technique, authored by Poetiq

54% on ARC-AGI-2, $31/task https://t.co/M5Uo5ZDmXv pic.twitter.com/88y0lIvTCb
— ARC Prize (@arcprize) December 7, 2025

Beyond the accuracy gains, Poetiq’s system reached this milestone at a cost of $30.57 per problem, compared to the $77.16 per problem cost of Gemini 3 Deep Think. This result suggests that progress in AI reasoning is moving away from purely scaling model size and reasoning tokens and toward the implementation of well-engineered systems that optimize performance at the application layer.

Continue

SOUNDPEATS Clip1 review: Open-ear audio with all-day comfort

By

Ben Dickson

-

December 7, 2025

The SOUNDPEATS Clip1 are open-ear, clip-on earbuds designed for users who prioritize comfort and situational awareness. Positioned as an alternative to traditional in-ear and bone conduction models, they are intended for long listening sessions. As someone who has used other SOUNDPEATS open-ear devices, I found them to be useful for various settings, including office use, and outdoor activities like running or cycling. The earbuds retail for $69.99, placing them in the affordable segment of the open-ear audio market.

Continue

What makes DeepSeek-V3.2 so efficient?

By

Ben Dickson

-

December 5, 2025

Just when there was growing concern that DeepSeek was a flash in the pan, the Chinese AI lab released the production ready DeepSeek-V3.2, one of two best open source and a top-five overall large language model.

Bro disappeared like he never existed. pic.twitter.com/Bn88feWU2c
— Nalin (@nalinrajput23) November 25, 2025

DeepSeek-V3.2 performs impressively well on a wide range of benchmarks, per its own reporting as well as the independent tests. As of the time of this writing, DeepSeek-V3.2 stands in fifth place on the Artificial Analysis index, behind Kimi K2 Thinking and ahead of Grok 4.

Continue

OpenAI’s code red: The curse of being at the forefront of AI

By

Ben Dickson

-

December 3, 2025

OpenAI is scrambling to recover from Google’s huge AI comeback after the latter released Gemini 3.0 Pro and Nano Banana Pro. OpenAI Sam Altman has declared “Code Red,” according to the Information, and has warned: “We are at a critical time for ChatGPT.” The company is reportedly cancelling plans for ads and other products and is focusing on releasing its next model that will outperform Gemini 3.

OpenAI is not a profitable company (even with around $20 billion in annual recurring revenue). It needs to raise capital from investors to fund its next generation of models and products. It has managed to raise tens of billions of dollars on the premise and promise that it is and will remain the undisputed leader in AI. With the sentiment being that it is no longer in the lead, it is less likely to get the next funding round, unless it comes up with a convincing plan to take back the lead.

Continue

What is next in reinforcement learning for LLMs?

By

Ben Dickson

-

December 1, 2025

Prompt injection attack tricks Google’s Antigravity into stealing your secrets

By

Ben Dickson

-

November 27, 2025

A newly discovered vulnerability in Google’s Antigravity platform demonstrates how its autonomous AI agents can be manipulated into exfiltrating sensitive data from a developer’s environment. Security researchers at PromptArmor found that an indirect prompt injection, hidden within a seemingly harmless online document, can coerce Antigravity’s AI into bypassing its own security settings to steal credentials and proprietary code. The attack leverages the very agentic capabilities that Google promotes as the platform’s core strength.

Continue

What to know about Claude Opus 4.5

By

Ben Dickson

-

November 25, 2025

The AI industry moves fast, but the week of November 18, 2025, may go down as one of the most chaotic in its history. In a span of just seven days, the definition of “state-of-the-art” shifted three times. First, Google’s Gemini 3 Pro claimed the top spot on November 18. The very next day, OpenAI released GPT-5.1-Codex-Max, taking the crown. Then, on November 24, Anthropic released Claude Opus 4.5, its new flagship large language model (LLM), positioning it as the new “best model in the world” specifically for coding, agents, and computer use.

Continue

What is next for Yann LeCun after his departure from Meta?

By

Ben Dickson

-

November 24, 2025

TechTalks

How GhostClaw malware targets the OpenClaw AI agent boom

Why Meta’s V-JEPA 2.1 model is a massive step forward for…

Multi-level AI prompt engineering: A new tool for scientific discovery

Why AI won’t kill SaaS

How C-JEPA is teaching AI the physics of the physical world

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

OpenAI’s GPT-5: A reality check for the AI hype train

OpenAI’s grand return to open source: unpacking the gpt-oss release

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Why AI benchmarks are broken

Salesforce tackles the ‘brittleness’ of web agents with new WALT framework

Beyond raw intelligence: How Poetiq cracked the ARC-AGI-2 benchmark

SOUNDPEATS Clip1 review: Open-ear audio with all-day comfort

What makes DeepSeek-V3.2 so efficient?

OpenAI’s code red: The curse of being at the forefront of AI

What is next in reinforcement learning for LLMs?

Prompt injection attack tricks Google’s Antigravity into stealing your secrets

What to know about Claude Opus 4.5

What is next for Yann LeCun after his departure from Meta?

Subscribe to continue reading

Subscribe to continue reading

Subscribe to continue reading