Blog

BLIP3o-NEXT: A new challenger in open-source AI image generation

November 3, 2025

This article is part of our coverage of the latest in AI research.

Salesforce Research, in collaboration with multiple universities, has released BLIP3o-NEXT, a new open-source model that unifies text-to-image generation and image editing within a single, streamlined architecture. Despite its relatively small size of 3 billion parameters, BLIP3o-NEXT demonstrates strong performance, matching or outperforming other open models on several key benchmarks.

Its intuitive design and open availability make it a promising foundation for developers looking to build custom generative AI applications or fine-tune models for specific use cases. As Ran Xu, Director of Applied AI Research at Salesforce and co-author of the paper, notes, the small model size and open-source assets make it an ideal starting point for customization. “If a startup has a specific use case or goal (such as generating layout of a design, generating planogram of a shelf) to be optimized, BLIP3o-NEXT provide a good framework to do RL to optimize the goal,” Xu told TechTalks.

How BLIP3o-NEXT works

BLIP3o-NEXT uses a hybrid architecture that combines an autoregressive model (the architecture used in classic language models) and a diffusion model (the architecture used in most image generation models). This two-stage approach leverages the distinct strengths of each component.

The autoregressive model acts as the brain, interpreting the user’s request and outlining the image’s overall structure and content. The diffusion model then acts as the artist, rendering the fine-grained, photorealistic details. The autoregressive model is based on Qwen3, while the diffusion transformer uses SANA1.5.

The autoregressive model’s main task is to process the user’s input—which can be a text prompt or a combination of text and reference images—and convert it into a sequence of discrete image tokens. Think of these tokens as words in a visual vocabulary that describe the core elements of the image. This approach differs from earlier models that generated continuous embeddings, which are more complex numerical representations.

The diffusion model then takes over, but it doesn’t just use the final discrete tokens. Instead, it uses their “hidden states” (the richer, more granular numerical data that the autoregressive model considered before settling on each final token). These hidden states provide a more detailed set of instructions, which the diffusion model uses as a conditioning signal to generate the final, high-fidelity image from an initial field of random noise.

Training for quality and consistency

The model’s core capabilities were developed by training the autoregressive component on three primary tasks: text-to-image generation, image reconstruction, and image editing. After this initial training, the researchers used reinforcement learning (RL) to further refine the model’s performance on specific skills, such as improving the quality of rendered text within images. For developers wondering how complex this is to implement for their own needs, Xu said, “I think it’ll be very easy to do.”

For image editing, maintaining consistency with the reference image is a major challenge, as high-level semantic representations can miss fine-grained pixel details. According to Xu, this is a key area where open-source models face a fundamental bottleneck. “Closed source models are likely to be trained with huge amounts of data, which requires significant infrastructure investment,” he explains. To compete, “we invested in innovation on new architectures and efficient training recipe, and open source the findings to benefit the research community.”

If you enjoyed this article, please consider supporting TechTalks with a paid subscription (and gain access to subscriber-only posts)

Subscribe to TechTalks

To address this challenge within these constraints, the team incorporated low-level VAE latents as an additional conditioning signal for the diffusion model. A VAE (Variational Autoencoder) can compress an image into a compact representation (its latent state) that preserves detailed visual information.

These VAE features from the reference image are integrated in two ways: they are combined with the hidden states from the autoregressive model, and they are also injected into the initial noise that the diffusion process starts with. This dual approach ensures that the generated image remains highly consistent with the original while accurately reflecting the user’s edits. This, combined with the strong multimodal reasoning of its autoregressive backbone, significantly improves BLIP3o-NEXT’s ability to follow complex instructions.

While the model shows impressive results, the researchers note there is still room for improvement, such as applying reinforcement learning more directly to image editing to further boost instruction following and consistency.

When asked whether the next breakthrough will come from better data or better reward models for RL, Xu sees an iterative path forward. “Data engine improvement provides a certain degree of certainty to improve the quality of the model, while RL is an active research field that may change the landscape significantly if there is a breakthrough,” he said.

In their paper, the researchers write, “Looking forward, we believe the findings presented here point toward promising directions for the next frontier of foundation models, where unified architectures, reinforcement learning, and scalable post-training jointly drive progress in controllable, instruction-aligned, and high-quality native image generation systems.”

How the AI arms race moved from smart models to full-stack…

Why LLMs should stop thinking out loud (and what comes after…

Beyond vibe coding: How Codev 3.0 engineers the AI-powered dev team

How Cursor’s Composer 2.5 uses self-distillation to beat the frontier LLMs…

Vertical integration as AI infrastructure: What 21D’s full arch implant system…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

BLIP3o-NEXT: A new challenger in open-source AI image generation

How BLIP3o-NEXT works

Training for quality and consistency

Like this:

Leave a ReplyCancel reply

How BLIP3o-NEXT works

Training for quality and consistency

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks