AI Glossary
Multimodal AI
AI that understands text, images, and more at once
Definition
Multimodal AI refers to models that can process and generate multiple types of data — text, images, audio, video — within the same model. GPT-4o and Gemini 1.5 are examples: you can show them an image and ask questions about it. Multimodal models are enabling new applications that were impossible when AI was text-only.