Mixture of Experts
Also known as: MoE, sparse MoE, mixture-of-experts architecture
Traditional dense models use all their parameters on every token. A mixture-of-experts model routes each token to just a subset of specialized sub-networks. A gating mechanism decides which experts handle which inputs. The result is a model with a huge total parameter count that behaves more like a smaller model at inference time, because most parameters aren't activated.
Llama 4 used MoE as a core architectural choice: Maverick has 400B total parameters but only 17B active parameters per token, which is why it can run on fewer GPUs than its total size implies. Mixtral from Mistral AI was an earlier widely-used open MoE model. The MoE design is now standard in most frontier-scale open-weight models.
For builders, MoE matters primarily when evaluating open-weight models for self-hosting: a model's active parameter count (not total) drives inference cost and speed. A 400B MoE model can be both more capable and cheaper to run than a 70B dense model, depending on the task.