Multimodal output
Also known as: native image output, any-to-any model, cross-modal generation
Early multimodal models could accept different input types but would always produce text as output. Multimodal output models close the loop: they can generate images, audio, video, or combinations of these natively, without routing to a separate specialized model. This is the shift from 'understand multiple modalities' to 'produce multiple modalities.'
GPT-4o's native image generation, Gemini's native audio understanding and output, and models like Sora and Veo that produce video from text prompts are all instances of this capability. The practical impact for builders is that workflows which used to chain a text model to a separate image or audio generation API can increasingly be handled by a single model call, simplifying architecture and often improving coherence between text and media outputs.
The direction of travel is toward 'any-to-any' models: input text, images, audio, or video, output any combination of those. This enables products like voice assistants that can also sketch diagrams, tutoring systems that respond with annotated visuals, or creative tools that generate matched audio and imagery together. Builders should track which multimodal output capabilities each model provider supports, since this is still an area of rapid differentiation.