Inception Labs introduced Mercury 2 on Thursday, calling it the world's fastest reasoning language model. Per the company's announcement, it generates about 1,000 tokens per second—the chunks of text an AI model reads and writes—against roughly 89 tokens per second for Anthropic’s Claude Haiku 4.5 Reasoning and 71 for OpenAI’s GPT-5 Mini.
Welcome to the diffusion era.
We bet on parallel generation years ago, when it was a contrarian idea. It's great to see the industry arrive.
Both models get there by dropping the typewriter approach to writing. A standard chatbot writes one word, checks what it just wrote, then writes the next, looping until the answer is finished. Diffusion models instead fill a block of text with random placeholder tokens and erase the noise across a handful of parallel passes—the same trick that turns static into a photo in image generators like Stable Diffusion—until the whole block locks into a finished response at once.
Where the two diverge is what survives that process. On AIME 2026—built from real American Invitational Mathematics Examination problems and scored as the percentage solved correctly—Mercury 2 hit 90%. Google tested DiffusionGemma on the same set, where it scored 69.1%, while standard, non-diffusion Gemma 4 scored 88.3% on the same test.
On GPQA, a PhD-level science benchmark scored the same way, the two models nearly tie: Mercury 2 at 77% against DiffusionGemma's 73.2%. But Google's own developer guide recommends standard Gemma 4 for applications that demand maximum quality, conceding DiffusionGemma trails it across the board.
Inception was built on research from its founder Stefano Ermon, a Stanford professor who co-authored some of the score-based diffusion techniques that power today's image generators. The startup's $50 million funding round drew backing from Nvidia's venture arm and individual investors Andrew Ng and Andrej Karpathy.
For non-technical users, the big thing most people don't notice until they feel it is the "flow." Traditional models make you wait between thoughts in a long session. Diffusion models like this make the AI feel like it's keeping pace with you—instant autocomplete, rapid iterations on code or plans, and sub-agents that can handle the boring high-volume work without dragging the whole system down.
That subagent layer is the interesting architectural shift. Complex AI systems aren't one giant smart model anymore. They're orchestras of specialized helpers: one for deep reasoning, several for quick summarization, routing, tool lookup, output checking, etc. Sequential models make those utility calls expensive and slow. Parallel diffusion ones make them cheap and fast enough to use liberally.
Realistic caveats for regular users: These are still best for speed-sensitive, high-volume parts of workflows rather than the absolute hardest frontier reasoning (where the biggest AR models may still have an edge for now). Mercury 2 isn't open weights, so it's API/cloud for now. And like Google's version, the full ecosystem (local runtimes, agent frameworks) is still catching up to make it seamless everywhere.
Use cases that pop immediately: real-time quick programming and "vibe coding" where the model keeps up with your edits, multi-agent coding or support systems where lots of fast sub-calls happen, voice interfaces that don't feel laggy, and any latency-sensitive autocomplete or next-action prediction. At scale, the cost and energy savings from higher throughput on standard hardware add up fast.




















