AI/TLDR

What Are Multimodal Embeddings? Searching Images with Text

Understand how models like CLIP map text and images into one shared vector space, so a typed query can retrieve a matching picture.

INTERMEDIATE10 MIN READUPDATED 2026-06-13

In plain English

A regular text embedding turns a piece of text into a list of numbers that captures its meaning. Sentences about the same idea land close together; unrelated ones land far apart. That works beautifully — as long as everything you compare is text.

Multimodal Embeddings — illustration
Multimodal Embeddings — cdn.prod.website-files.com

But what if you want to type the words "a golden retriever on a beach" and have your app find the matching photo, even though the photo has no caption? Text lives in words; the photo lives in pixels. They speak different languages, so their vectors normally have nothing to do with each other.

Multimodal embeddings fix this by putting text and images (and sometimes audio or video) into one shared coordinate space. A model like CLIP learns to place the words "a dog on a beach" and an actual photo of a dog on a beach at nearly the same point in that space. Once everything lives in the same map, comparing a text query to an image is just measuring the distance between two vectors — exactly like ordinary semantic search, except now the two sides can be different kinds of media.

Think of a shared map of a city where every photo of a place and every written description of that place are pinned to the exact same spot. "The red bridge at sunset" and a snapshot of that bridge both land on the same pin. To answer "show me the red bridge," you don't translate anything — you just look up the pin for the words and grab whatever else is parked there, photos included.

Why it matters

Most of the world's useful data is not clean text. It's product photos, scanned documents, screenshots, X-rays, security footage, memes, and audio clips. A text-only embedding model can't search any of it directly — you'd first have to caption every image by hand, which doesn't scale. Multimodal embeddings let you search media by its content, with no manual labels.

Here are the everyday problems a shared text-image space solves:

  • Search photos by description. Type "invoice with a red overdue stamp" and find the right scan, even if no one tagged it. This powers visual search in e-commerce, photo libraries, and digital-asset management.
  • Zero-shot image classification. Want to know if an image is a cat, a dog, or a car? Embed the image once, embed the words "a cat," "a dog," "a car," and pick whichever label sits closest. No training a new classifier — you just write the category names as text.
  • Deduplication and clustering. Because near-identical images sit at nearly the same point, you can find duplicate or visually similar photos cheaply, then cluster a messy library into groups.
  • Captioning and retrieval for RAG. Multimodal retrieval lets an assistant pull the relevant chart or diagram into its context, not just text — the foundation of multimodal retrieval-augmented generation.
  • Cross-modal recommendation. "More images like this one" and "images that match this song's mood" both become simple nearest-neighbor lookups once the modalities share a space.

The deeper reason a builder cares: it collapses what used to be many specialized models into one. Before CLIP-style models, image search, classification, and captioning each needed their own labeled dataset and trained network. A single multimodal embedding model handles all of them through the same trick — embed both sides, compare distances — and it generalizes to categories it was never explicitly trained on.

How it works

A multimodal embedding model has two encoders: a text encoder (a transformer that reads words) and an image encoder (a vision model that reads pixels). Each produces a vector. The whole job of training is to make those two encoders agree — to push the vector for a caption and the vector for its matching image toward the same place, while pushing mismatched pairs apart.

Contrastive training on image–caption pairs

The training data is hundreds of millions of (image, caption) pairs scraped from the web — a photo and the alt-text or caption that came with it. The model learns by a method called contrastive learning. In each batch it takes, say, 100 images and their 100 real captions, embeds all of them, and computes the similarity of every image to every caption — a 100×100 grid.

The training goal is simple: for each image, the correct caption (the diagonal of the grid) should score highest, and all 99 wrong captions should score low. The model is rewarded for pulling true pairs together and pushing false pairs apart. Repeat across billions of pairs and the two encoders gradually learn a shared geometry where meaning lines up across modalities. (This is the same family of idea behind how embeddings are trained in general — only here the positive pairs cross modalities.)

Querying: it's just nearest-neighbor search

Once trained, using the model is the easy part. You embed every image in your collection once, up front, and store the vectors in a vector database. At query time you embed the user's text with the text encoder and ask the database for the nearest image vectors — typically by cosine similarity. The images closest to the query vector are your search results.

Notice the symmetry: because both sides land in the same space, the same index also supports image-to-image search ("more like this photo") and image-to-text search (find the caption nearest a picture). One shared space, many query directions.

A worked example in code

Here is the entire idea in a few lines using an open CLIP-style model. The shape is always the same: embed the images once, embed the text query, compare with cosine similarity, sort.

text_to_image_search.pypython
import numpy as np
from sentence_transformers import SentenceTransformer
from PIL import Image

# One model, two input types. CLIP-style models accept BOTH
# images and text and place them in the SAME vector space.
model = SentenceTransformer("clip-ViT-B-32")

# 1) INGEST: embed your image library once, up front.
image_paths = ["dog_beach.jpg", "city_night.jpg", "red_bridge.jpg"]
images = [Image.open(p) for p in image_paths]
img_vecs = model.encode(images, normalize_embeddings=True)

# 2) QUERY: embed the user's text with the SAME model.
query = "a golden retriever on a beach"
q_vec = model.encode(query, normalize_embeddings=True)

# 3) COMPARE: cosine similarity is just a dot product on
#    normalized vectors. Highest score = best match.
scores = img_vecs @ q_vec
best = int(np.argmax(scores))
print(f"Best match: {image_paths[best]}  (score {scores[best]:.3f})")

Common pitfalls

Multimodal embeddings are easy to demo and easy to misuse. Most mistakes come from forgetting that both sides of a comparison must be produced under the same conditions.

  • Mixing models across the query. This is the #1 gotcha. You must embed the query and the stored items with the same multimodal model. A vector from CLIP and a vector from a different model are not comparable, even if they're the same length — the dimensions don't mean the same thing.
  • Expecting text-text quality. A model trained mainly on short web captions is great at "is this a photo of a dog?" but weaker at long, nuanced text-to-text matching than a dedicated text embedder. If you need both strong text search and image search, you may run two models.
  • Fine-grained detail and OCR. CLIP-style models capture the gist of an image ("a street scene") far better than fine print. Counting objects, reading text inside an image, or distinguishing two similar bird species is where they struggle.
  • Forgetting to normalize. Cosine similarity assumes unit-length vectors. If you skip normalization, magnitude leaks into the score and your rankings get noisy — see cosine similarity vs. dot product.
  • Domain mismatch. A model trained on natural web photos may do poorly on medical scans, satellite imagery, or CAD drawings. The shared space only aligns concepts the model actually saw during training.

Text embeddings vs. multimodal embeddings

It helps to see exactly what changes when you move from a text-only embedder to a multimodal one. The mechanics of search are identical; what differs is what can go into the space.

AspectText embeddingsMultimodal embeddings
What goes inText onlyText + images (sometimes audio/video)
EncodersOne text encoderSeparate encoder per modality, one shared space
Trained onText pairs / labeled textImage–caption (and other) pairs, contrastively
Can answer"Find docs like this sentence""Find images matching this sentence"
Classic exampleSentence-Transformers, OpenAI text-embeddingCLIP, SigLIP, and successors
Best atNuanced long-text matchingCross-modal gist matching

The continuity is the point: if you already understand how text embeddings work and semantic search, you already understand 90% of multimodal embeddings. The only new concept is the shared space spanning more than one kind of input.

Going deeper

The plain CLIP recipe — two encoders, contrastive loss, image–caption pairs — is the foundation, and the field has refined every part of it.

Better losses and bigger data. CLIP used a softmax-style contrastive loss across a large batch. Later models like SigLIP swapped in a sigmoid loss that scales to enormous batches more cheaply and often improves accuracy. The headline trend has simply been more data, larger encoders, and cleaner caption filtering.

Beyond two modalities. The same alignment trick extends past text and image. Models have aligned audio, video, depth, and more into a single space, so you can retrieve a video from a sound or an image from a typed phrase — all with one nearest-neighbor lookup. The recipe generalizes: pick paired data, train contrastively, share the space.

Contrastive embeddings vs. generative VLMs. Don't confuse a CLIP-style embedding model with a generative vision-language model that writes a paragraph about an image. The embedding model gives you a single vector for fast search and classification; the generative model produces free text and reasons about the image, but is far slower and can't be indexed. They're complementary: many systems use the embedding model to retrieve candidate images, then a generative model to describe or reason over them.

Production concerns. At scale you'll index millions of image vectors, so the same infrastructure questions as text search apply: approximate nearest-neighbor search and indexes like HNSW to keep lookups fast, plus picking an embedding model that matches your domain and latency budget. The vectors happen to come from images, but everything downstream is ordinary vector search.

The durable lesson: multimodal search isn't a different machine bolted onto image search — it's the same embedding-and-distance machine, with a model trained so that two different kinds of input land in one shared coordinate space. Get that single idea and the rest is engineering.

FAQ

What are multimodal embeddings?

Multimodal embeddings are vectors that place different kinds of input — most commonly text and images — into a single shared coordinate space. Because a caption and its matching image land at nearly the same point, you can compare them by distance, which is what lets a text query retrieve a matching picture.

How does CLIP let you search images with text?

CLIP has two encoders, one for text and one for images, trained on hundreds of millions of image–caption pairs so that matching pairs land close together. At query time you embed your text with the text encoder, embed your images with the image encoder, and return the images whose vectors are nearest the query vector by cosine similarity.

Do text and image vectors really live in the same space?

Yes — that is the entire design goal. Contrastive training pushes a true (image, caption) pair toward the same location while pushing mismatched pairs apart. After training, a sentence and its matching photo sit close together, so distance in that one shared space measures cross-modal relevance directly.

Can I use one model for images and a different one for text?

No. You must embed both sides of a comparison with the same multimodal model. Vectors from two different models are not comparable even if they have the same length, because their dimensions encode different things. Mixing models gives meaningless similarity scores.

What is contrastive learning in this context?

It is the training method behind CLIP-style models. The model embeds a batch of images and captions, scores every image against every caption, and is rewarded for ranking the correct caption highest while ranking all the wrong ones low. Repeated over billions of pairs, this teaches the two encoders to share one aligned space.

What are multimodal embeddings bad at?

They capture the overall gist of an image well but struggle with fine detail: counting objects, reading small text inside an image, and telling apart very similar categories. They also inherit biases from their web-scraped captions and may perform poorly on specialized domains, like medical or satellite imagery, that differ from typical web photos.

Further reading