Overview
Grounded-SAM is an open-source pipeline from IDEA-Research that chains two vision models together: Grounding DINO finds objects in an image that match a text prompt, and Segment Anything (SAM) turns each detection into a precise pixel mask. The result is open-vocabulary segmentation — you describe what you want in plain words like "bear" or "the person on the left" and get back boxes and masks for it.
It is aimed at computer vision researchers and engineers who need to label or segment objects without training a model for each new class. Because it works zero-shot from text, it suits open-world tasks where the list of categories is not fixed in advance.
Within the multimodal vision space, Grounded-SAM acts as a building block rather than a single model. Each part can be used on its own or swapped for a similar model (for example replacing Grounding DINO with another detector), and the project ships extra demos that pair it with tools like RAM for automatic labeling and Stable Diffusion for inpainting.
What it does
- Open-vocabulary detection and segmentation driven by free-text prompts, no per-class training
- Combines Grounding DINO (text-to-box) with Segment Anything (box-to-mask) in one pipeline
- Modular design — each model can run standalone or be replaced with a similar one
- RAM-Grounded-SAM demo for automatic image labeling and annotation pipelines
- EfficientSAM variants (FastSAM, MobileSAM, HQ-SAM) for faster or higher-quality masks
- Also available through Hugging Face Transformers for Grounding DINO and Grounded SAM
Getting started
Grounded-SAM runs locally with Python 3.8+, PyTorch 1.7+, and torchvision 0.8+. Clone the repo, install the two submodules and dependencies, download the pretrained weights, then run the demo with a text prompt.
Clone the repository
Get the source, which includes the Segment Anything and GroundingDINO subfolders.
git clone https://github.com/IDEA-Research/Grounded-Segment-Anything
cd Grounded-Segment-AnythingInstall the models and dependencies
Install Segment Anything and Grounding DINO as editable packages, then the extra Python dependencies.
python -m pip install -e segment_anything
pip install --no-build-isolation -e GroundingDINO
pip install opencv-python pycocotools matplotlib onnxruntime onnx ipykernelDownload the pretrained weights
Fetch the SAM ViT-H checkpoint and the Grounding DINO SwinT checkpoint into the repo folder.
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pthRun the demo with a text prompt
Point the demo at an image and a prompt; it writes the detection and segmentation results to the output directory.
python grounded_sam_demo.py \
--config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
--grounded_checkpoint groundingdino_swint_ogc.pth \
--sam_checkpoint sam_vit_h_4b8939.pth \
--input_image assets/demo1.jpg \
--output_dir "outputs" \
--text_prompt "bear"Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Auto-labeling image datasets by describing target classes in text instead of drawing masks by hand
- Zero-shot segmentation of objects that were never in a fixed training label set
- Building annotation pipelines, for example pairing RAM-Grounded-SAM to tag and segment images automatically
- Prototyping open-world vision workflows that combine detection, segmentation, and downstream editing or inpainting
How Grounded-SAM compares
Grounded-SAM alongside other open-source vision & understanding tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| PaddleOCR | ★ 83.1k | A toolkit for detecting and recognizing text in images across many languages, plus document parsing. |
| Ultralytics YOLO | ★ 58.6k | A framework for training and running YOLO models for real-time object detection, segmentation, and tracking. |
| Supervision | ★ 44.7k | A Python toolkit for processing, annotating, and visualizing detections and segmentations from many vision models. |
| MMDetection | ★ 32.8k | An OpenMMLab toolbox with many object detection and instance segmentation algorithms for research and production. |
| Segment Anything 2 (SAM 2) | ★ 19.4k | Meta's model for segmenting and tracking any object across images and video frames from clicks or boxes. |
| Grounded-SAM | ★ 17.6k | Detect and segment anything in an image from a text prompt |
| DINOv3 | ★ 10.7k | Meta's self-supervised vision backbone that produces general-purpose image features for many downstream tasks. |
| Segment Anything 3 (SAM 3) | ★ 10.6k | Meta's segmentation model that detects, segments, and tracks objects in images and video from text or visual prompts. |