Overview
EmbeddingGemma is a 308M-parameter text embedding model from Google DeepMind, based on Gemma 3. It turns text into dense vectors that you can use for semantic search, retrieval-augmented generation (RAG), clustering, and similarity comparison. It was trained on data covering over 100 languages, so it works across multilingual content.
Because it is small, it is meant to run on-device or on modest hardware (CPU, GPU, or TPU) rather than requiring a large server. It also supports Matryoshka Representation Learning, which lets you shorten the output vector (for example from 768 down to 128 dimensions) to trade a little accuracy for lower storage and faster search.
It fits the embedding-models category: you generate embeddings once for your documents, store them in a vector index, and compare them against query embeddings at search time. The recommended way to run it is through the sentence-transformers library using the open weights published on Hugging Face.
What it does
- 308M-parameter embedding model based on Gemma 3, suited to on-device and CPU/GPU/TPU use
- Trained on data spanning over 100 languages for multilingual text
- Matryoshka Representation Learning: truncate output dimensions from 768 down to 128 to save storage
- Open weights licensed for commercial use, published on Hugging Face as google/embeddinggemma-300M
- Runs through the standard sentence-transformers API with model.encode() and model.similarity()
- Built for RAG, semantic search, clustering, and similarity tasks
Getting started
EmbeddingGemma is loaded through the sentence-transformers library. Install the dependencies, then load the model from Hugging Face and call encode().
Install sentence-transformers
Install sentence-transformers along with the transformers build that supports EmbeddingGemma.
pip install -U sentence-transformers git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-previewGenerate embeddings
Load the model by its Hugging Face ID and call encode() on your text. Use similarity() to compare two embeddings.
import torch
from sentence_transformers import SentenceTransformer
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("google/embeddinggemma-300M").to(device=device)
texts = ["apple", "banana", "car"]
embeddings = model.encode(texts)
similarities = model.similarity(embeddings[0], embeddings[1])Shrink the embedding dimension (optional)
Pass truncate_dim to get shorter vectors via Matryoshka Representation Learning, which reduces storage and speeds up search.
embeddings = model.encode(texts, truncate_dim=512)Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Build a RAG pipeline: embed your documents, store them in a vector index, and retrieve the closest chunks for a user query
- Add semantic search to an app so results match meaning rather than exact keywords
- Run embeddings on-device or on a CPU where a larger model would be too heavy
- Embed multilingual content for cross-language search, clustering, or deduplication
How EmbeddingGemma (Gemma) compares
EmbeddingGemma (Gemma) alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Sentence Transformers | ★ 18.8k | The standard Python framework for loading, training, and computing embeddings with sentence and reranking models. |
| EmbeddingGemma (Gemma) | ★ 5.5k | A 308M multilingual embedding model small enough to run on-device |
| Text Embeddings Inference (TEI) | ★ 4.9k | Hugging Face's Rust-based server for deploying embedding, reranking, and sequence-classification models with high throughput on GPU or CPU. |
| Infinity (Embeddings) | ★ 2.8k | A high-throughput serving engine for text embeddings, rerankers, CLIP, and ColPali models, exposing an OpenAI-compatible API. |
| ColPali | ★ 2.7k | A vision-language embedding model that indexes whole document page images for retrieval, avoiding the need to parse PDFs into text first. |
| Model2Vec | ★ 2.1k | A tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference. |
| Instructor Embedding | ★ 2k | Instruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction. |
| Qwen3-Embedding | ★ 2k | Alibaba's open embedding and reranking models built on the Qwen3 base, available in 0.6B/4B/8B sizes and covering over 100 languages. |