Home/Glossary/Multimodal Embedding

AI Glossary

Multimodal Embedding

Shared vector space for text, images, and other data types

Definition

Multimodal embeddings place different types of data — text, images, audio — into a shared vector space so that semantically related content across modalities has similar representations. This enables cross-modal search (finding images from text descriptions), image captioning, and visual question answering. CLIP (by OpenAI) is a widely known multimodal embedding model.

Related Terms

Back to Glossary