Running an AI model on your own computer is great—until it isn't.
The promise is privacy, no subscription fees, and no data leaving your machine. The reality, for most people, is watching a cursor blink for five seconds between sentences.
That bottleneck has a name: inference speed. And it has nothing to do with how smart the model is. It's a hardware problem. Standard AI models generate text one word fragment—called a token—at a time. The hardware has to shuttle billions of parameters from memory to its compute units just to produce each single token. It's slow by design. On consumer hardware, it's painful.

The approach is called speculative decoding, and it's been around as a concept for years. Google researchers published the foundational paper back in 2022. The idea didn't go mainstream until now because it required the right architecture to make it work at scale.
Here's the short version of how it works. Instead of making the big, powerful model do all the work alone, you pair it with a tiny "drafter" model. The drafter is fast and cheap—it predicts several tokens at once in less time than the main model would take to produce just one. Then the big model checks all of those guesses in a single pass. If the guesses are right, then you get the whole sequence for the price of one forward pass.
Nothing is sacrificed: The large model—Gemma 4's 31B dense version, for example—still verifies every token, and the output quality is identical. You're just exploiting idle compute power that was sitting unused during the slow parts.
Google says the drafter models share the target model's KV cache—a memory structure that stores already-processed context—so they don't waste time recalculating things the larger model already knows. For the smaller edge models designed for phones and Raspberry Pi devices, the team even built an efficient clustering technique to further cut generation time.
This isn't the only attempt the AI world has made at parallelizing text generation. Diffusion-based language models—like Mercury from Inception Labs—tried a completely different approach: Instead of predicting one token at a time, they start with noise and iteratively refine the entire output. That’s fast on paper, but diffusion LLMs have struggled to match the quality of traditional transformer models, leaving them more of a research curiosity than a practical tool.
Speculative decoding is different because it doesn't change the underlying model at all. It's a serving optimization, not an architecture replacement. The same Gemma 4 you'd already run gets faster.
The practical upside is real. A Gemma 4 26B model running on an Nvidia RTX Pro 6000 desktop GPU gets roughly twice the tokens per second with the MTP drafter enabled, according to Google's own benchmarks. On Apple Silicon, batch sizes of 4 to 8 requests unlock around 2.2x speedups. Not quite the 3x ceiling in every scenario, but still a meaningful difference between "barely usable" and "actually fast enough to work with."
Google says the drafter unlocks "improved responsiveness: drastically reduce latency for near real-time chat, immersive voice applications and agentic workflows"—the kind of tasks that demand low latency to feel useful at all.
Use cases snap into focus quickly: A local coding assistant that doesn't lag; a voice interface that responds before you've forgotten what you asked; an agentic workflow that doesn't make you wait three seconds between steps. All of this, on hardware you already own.



















