Multimodal model
Also known as: multimodal, multimodal ai, multi-modal model, multimodal llm
Early LLMs worked on text in, text out. Multimodal models remove that constraint. They can receive an image and describe it, take a screenshot and explain the UI, process audio and transcribe it, analyze a chart and summarize the data, or combine all of these in a single request. Models like GPT-5, Claude Opus 4, Gemini 3, and Qwen3-Omni are all multimodal.
For builders, multimodal capability unlocks product categories that text-only models could not address: visual design feedback ('here is a screenshot, what is wrong with this UI'), document analysis ('here is a PDF, extract the key terms'), voice interfaces, and computer use (seeing a screen and deciding what to click). The computer use capability described elsewhere in this glossary is only possible because models can see screenshots.
In practice, builders in 2026 rarely choose a model specifically for multimodal capability because most frontier models have it. The more important question is which modalities they need at what quality level. Vision and text are well-covered across frontier models. Audio-to-text is mature. Video understanding and video generation remain areas where quality varies significantly between models and providers.