Text-to-image
Also known as: text to image generation, T2I
Text-to-image is the foundational task of modern AI image generation. A model is trained on enormous datasets of image-text pairs, learning to map language descriptions onto visual content. At generation time, you provide a prompt and the model produces an image that matches it, starting from random noise and progressively refining it into something coherent.
The quality gap between models on this task has collapsed dramatically. In 2022, most text-to-image outputs were obviously artificial and frequently mangled human anatomy. By 2025, the best outputs are photorealistic enough to fool casual observers, and consistent enough for commercial production work. Key dimensions that still differ between models include prompt adherence (how literally does the model follow your words), text rendering inside images (historically terrible, now a solved problem in top models), consistency across multiple generations, and the model's aesthetic biases.
Text-to-image is now a commodity capability embedded across dozens of tools. What differentiates products built on it is the workflow around it: fine-tuning for brand style, editing capabilities, integration with other creative tools, commercial licensing terms, and the specific visual aesthetic the underlying model defaults to.