MoE
Also known as: mixture of experts, mixture-of-experts, sparse model, MoE architecture
A standard LLM activates its entire set of parameters on every query. MoE models instead route each token (a word-chunk input) through a small subset of the available 'expert' networks, chosen by a gating mechanism. The total parameter count looks huge on paper, but only a fraction fires at any given moment, making the model more efficient to run than its size implies.
Mixtral by Mistral was one of the first open models to popularize MoE at scale, and GPT-4 was widely reported to use a MoE architecture. The design matters for builders for two reasons: inference (generating answers) is cheaper per query relative to the total model size, and specialized knowledge can be distributed across different experts, potentially improving quality on diverse tasks.
You'll see MoE mentioned when people compare the 'active parameters' of a model to its 'total parameters.' A model with 400B total parameters and 40B active parameters per query is a MoE model. It's a useful concept for reading model specs and understanding why two models with the same parameter count can have very different speed and cost profiles.