That's not a small problem. Multi-agent systems generate a lot more tokens than a normal chat—every tool call, reasoning step, and slice of context gets re-sent from scratch. As a result, costs explode, models tend to drift, and the agents slowly forget what they were supposed to be doing in the first place… or at least decrease in accuracy.
Nemotron 3 Super is Nvidia's answer to all of that. The model runs 12 billion active parameters out of 120 billion total, using a mixture-of-experts (MoE) design that keeps inference cheap while retaining the reasoning depth complex workflows need. It packs a 1-million-token context window, so agents can hold an entire codebase, or nearly 750,000 words in memory before collapsing.
To build its model, Nvidia combined three components that rarely appear together in the same architecture: Mamba-2 state-space layers—a faster, memory-efficient alternative to attention for handling long token streams—along with Transformer attention layers for precise recall, and a new “Latent MoE” design that compresses token embeddings before routing them to experts. That allows the model to activate four times as many specialists at the same compute cost.
Introducing NVIDIA Nemotron 3 Super
Open 120B-parameter (12B active) hybrid Mamba-Transformer MoE model
Native 1M-token context
Built for compute-efficient, high-accuracy multi-agent applications
The model was also pretrained natively in NVFP4, Nvidia’s 4-bit floating-point format. In practice, that means the system learned to operate accurately within 4-bit arithmetic from the very first gradient update, rather than being trained at high precision and compressed afterward, which often causes models to lose accuracy.
For context, a model’s precision is measured in bits. Full precision, known as FP32, is the gold standard—but it is also extremely expensive to run at scale. Developers often reduce precision to save compute while trying to preserve useful performance.
Think of it like shrinking a 4K image down to 1080p: The picture still looks the same at a glance, just with less detail. Normally, dropping from 32-bit precision all the way to 4-bit would cripple a model’s reasoning ability. Nemotron avoids that problem by learning to operate at low precision from the start, instead of being squeezed into it later.
We ran our own quick test. The reasoning held up well, including on prompts that were deliberately vague, badly worded, or based on wrong information. The model caught small errors in context without being asked to, handled math and logic problems cleanly, and didn't fall apart when the question itself was slightly off.

The full training pipeline is public: weights on Hugging Face, 10 trillion curated pretraining tokens seen over 25 trillion total during training, 40 million post-training samples, and reinforcement learning recipes across 21 environment configurations. Perplexity, Palantir, Cadence, and Siemens are already integrating the model in their workflows.
The $26 billion betThe investment is strategic considering Nvidia's chips are still the default infrastructure for training and running frontier models. Models tuned to its hardware give customers a built-in reason to stay on Nvidia despite efforts from competitors to use other hardware. But there's a more urgent pressure behind the move: America is losing the open-source AI race, and losing it fast.
While U.S. giants like OpenAI, Anthropic, and Google keep their best models locked behind APIs, Chinese labs from DeepSeek to Alibaba have been flooding the open ecosystem. Meta was the one major American player competing in open source with Llama, but Zuckerberg recently signaled the company might not make future models fully open.
The gap between "best proprietary model" and "best open model" used to be massive—and in America's favor. That gap is now very small, and the open side of the ledger is increasingly Chinese.
Incredible graph. In just one year, China completely overtook the U.S. in free AI models.
That's the scenario Nvidia most needs to prevent: Chinese open models and Chinese chips building an ecosystem that doesn't need Nvidia at all.


















