Nemotron 3 Ultra
Also known as: NVIDIA Nemotron Ultra, Nemotron Ultra 550B
Nemotron 3 Ultra is a Mixture-of-Experts (MoE) model: it has 550 billion total parameters but only activates about 55 billion of them per token. That sparsity is why it can serve over 300 tokens per second at inference time, a speed most models this large cannot match. NVIDIA released the weights on June 4, 2026, alongside training data and recipes, under a permissive Linux Foundation license that allows commercial use.
On the Artificial Analysis Intelligence Index, it scores 48 — the highest of any US open-weights model ever released, and well ahead of the previous US leader. The honest caveat: China's open-weights frontier (Kimi K2.6 at 54, DeepSeek V4 Pro) still leads globally. Nemotron 3 Ultra closes but does not erase that gap.
For builders, the relevant angle is the agentic design intent. NVIDIA built this model specifically for multi-step agent pipelines: planning, tool calling, delegating to subagents, reading observations, recovering from errors across hundreds of turns. It's available via Hugging Face, OpenRouter, and NVIDIA NIM (NVIDIA's hosted inference service), and is self-hostable on vLLM, SGLang, and TRT-LLM — but self-hosting requires datacenter GPU infrastructure.