
Google says so themselves. This is a speed model, not a quality upgrade.
What this actually doesEvery LLM you've used is a typewriter. One token at a time with each word dependent on the last. That's how autoregressive architectures work.

The side effect is bidirectional attention—every token can see every other token while being generated, which is impossible in autoregressive models (they cannot see the future, what is going to be encoded). That makes it unusually good at tasks where the end of the answer constrains the beginning: code infilling, structured output, constraint-heavy problems, etc. Google fine-tuned a version to solve Sudoku as a demo. The base model got roughly 0% of puzzles right.
The fine-tuned version hit 80%.
But none of that was open-weight, and none of it came with day-zero support in vLLM, Hugging Face Transformers, and Unsloth. DiffusionGemma is the first major open release from a tier-one lab.
There's also a historical irony worth noting. Image generators started as diffusion models (hence the name Stable Diffusion) and are now moving toward autoregressive architectures for better quality. Language models started as autoregressive and are now experimenting with diffusion for speed.
Why it’s a pain to run… for nowThe problem: DiffusionGemma needs a specific drafter to run locally via MLX—Apple's machine learning framework for Apple Silicon. That module doesn't exist in any public version of mlx-lm, in any open pull request, or in LM Studio's bundled runtime.
We tried running DiffusionGemma with Hermes through NVIDIA NIM. The model loaded, but then: "agent init failed: Model google/diffusiongemma-26b-a4b-it has a context window of 8,192 tokens, which is below the minimum 64,000 required by Hermes Agent."
To be precise: DiffusionGemma's actual context window is 256K tokens. The 8,192 figure was Nvidia messing things up by default, not the model's architectural limit.
In practice, getting it configured correctly for agentic use requires manual work that most everyday users haven't figured out yet, and Hermes Agent simply won't initialize without it. Parallel speed means nothing if the agent can't boot.
Hopefully, in the next few days, the community will produce better resources to run these models.
Who this is actually forFor researchers, bidirectional generation opens territory that autoregressive models simply can't reach—protein sequences, mathematical graphs, anything where position N depends on position N+50. That's not a small thing.
On a machine with a capable discrete GPU, 1,000 tokens per second is real.


















