Concept·AI Models & Capabilities·Added 2 months ago

Multimodal Input

Also known as: vision input, image input, file input, multimodal prompting

The ability to send images, PDFs, audio, video, or other non-text content to a model alongside or instead of text. Most frontier models now accept multiple input types; what you can do with them varies significantly by model and provider.

Early language models only processed text. Modern frontier models accept a range of input types: images (screenshots, photos, diagrams), PDFs and documents, audio (for transcription or response), and in some cases video. When you drop a screenshot into Claude or upload a PDF to GPT-5, that's multimodal input in use.

The practical applications are broad: read a chart and answer questions about it, extract text from a scanned document, analyze a UI screenshot and suggest code changes, transcribe an audio recording, or describe what's in an image. The model sees all of this as context alongside whatever text you include.

For builders, multimodal input opens product categories that were previously impossible: AI that can read a receipt, analyze a diagram, review a design mockup, or process a video frame as part of a workflow. The key question when evaluating models is not just whether they accept a format but how well they reason over it. Vision quality varies significantly between models, especially for charts, handwriting, and complex layouts.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.

Related terms