AI Glossary
Multimodal Embedding
Shared vector space for text, images, and other data types
Definition
Multimodal embeddings place different types of data — text, images, audio — into a shared vector space so that semantically related content across modalities has similar representations. This enables cross-modal search (finding images from text descriptions), image captioning, and visual question answering. CLIP (by OpenAI) is a widely known multimodal embedding model.