Google's DiffusionGemma AI Hits 1,000 Tokens Per Second

Google's DiffusionGemma AI Hits 1,000 Tokens Per Second—And It's Free

Jun 11, 2026

4.2

★

107 User Rating

Google says so themselves. This is a speed model, not a quality upgrade.

What this actually does

Every LLM you've used is a typewriter. One token at a time with each word dependent on the last. That's how autoregressive architectures work.

The side effect is bidirectional attention—every token can see every other token while being generated, which is impossible in autoregressive models (they cannot see the future, what is going to be encoded). That makes it unusually good at tasks where the end of the answer constrains the beginning: code infilling, structured output, constraint-heavy problems, etc. Google fine-tuned a version to solve Sudoku as a demo. The base model got roughly 0% of puzzles right.

The fine-tuned version hit 80%.

But none of that was open-weight, and none of it came with day-zero support in vLLM, Hugging Face Transformers, and Unsloth. DiffusionGemma is the first major open release from a tier-one lab.

There's also a historical irony worth noting. Image generators started as diffusion models (hence the name Stable Diffusion) and are now moving toward autoregressive architectures for better quality. Language models started as autoregressive and are now experimenting with diffusion for speed.

Why it’s a pain to run… for now

The problem: DiffusionGemma needs a specific drafter to run locally via MLX—Apple's machine learning framework for Apple Silicon. That module doesn't exist in any public version of mlx-lm, in any open pull request, or in LM Studio's bundled runtime.

We tried running DiffusionGemma with Hermes through NVIDIA NIM. The model loaded, but then: "agent init failed: Model google/diffusiongemma-26b-a4b-it has a context window of 8,192 tokens, which is below the minimum 64,000 required by Hermes Agent."

To be precise: DiffusionGemma's actual context window is 256K tokens. The 8,192 figure was Nvidia messing things up by default, not the model's architectural limit.

In practice, getting it configured correctly for agentic use requires manual work that most everyday users haven't figured out yet, and Hermes Agent simply won't initialize without it. Parallel speed means nothing if the agent can't boot.

Hopefully, in the next few days, the community will produce better resources to run these models.

Who this is actually for

For researchers, bidirectional generation opens territory that autoregressive models simply can't reach—protein sequences, mathematical graphs, anything where position N depends on position N+50. That's not a small thing.

On a machine with a capable discrete GPU, 1,000 tokens per second is real.

DeXeDEXE	$4.1180 +145.56%
FusionistACE	$0.1098 +36.87%
SolsticeSLX	$0.1224 +17.35%
Vulcan ForgedPYR	$0.0900 +16.88%
CapCAP	$0.0276 +16.84%

UniswapUNI	$3.8010 +1.14%
AudieraBEAT	$3.2922 +8.62%
ZcashZEC	$487.840 -3.44%
CelestiaTIA	$0.3398 -2.50%
Space Exploration TechnologiesSPCX	$113.150 -2.73%

KetKET	$0.0119 -7.29%
Direxion MU Bull 2X ETFMUUB	$31.0100 -14.62%
GraniteShares 2X Long INTC ETFINTWB	$20.4700 -23.42%
AXTAXTIB	$47.4600 -10.60%
GraniteShares 2X Long MRVL ETFMVLLB	$22.1700 -16.34%

Google's DiffusionGemma AI Hits 1,000 Tokens Per Second—And It's Free

Latest News

Industry

Cryptocurrency

Airdrop

Markets

Brazil’s CVM Launches 60-Day Sprint to Tokenize Securities

Hyperliquid Enables Permissionless Markets With HIP-4 Plan

DTCC Launches Live Tokenized Asset Trading for Wall Street

South Korea Updates Asset Law to Include Cryptocurrency

New SEC Crypto Rule to Cut Red Tape for Startup Fundraising

Top

Top Gainers

Top Trending

Recently added

Learn