Multimodal Input
Also known as: vision input, image input, file input, multimodal prompting
Early language models only processed text. Modern frontier models accept a range of input types: images (screenshots, photos, diagrams), PDFs and documents, audio (for transcription or response), and in some cases video. When you drop a screenshot into Claude or upload a PDF to GPT-5, that's multimodal input in use.
The practical applications are broad: read a chart and answer questions about it, extract text from a scanned document, analyze a UI screenshot and suggest code changes, transcribe an audio recording, or describe what's in an image. The model sees all of this as context alongside whatever text you include.
For builders, multimodal input opens product categories that were previously impossible: AI that can read a receipt, analyze a diagram, review a design mockup, or process a video frame as part of a workflow. The key question when evaluating models is not just whether they accept a format but how well they reason over it. Vision quality varies significantly between models, especially for charts, handwriting, and complex layouts.