Grounded-SAM

Detect and segment anything in an image from a text prompt

github.com/IDEA-Research/Grounded-Segment-Anything★ 17.6k

Overview

Grounded-SAM is an open-source pipeline from IDEA-Research that chains two vision models together: Grounding DINO finds objects in an image that match a text prompt, and Segment Anything (SAM) turns each detection into a precise pixel mask. The result is open-vocabulary segmentation — you describe what you want in plain words like "bear" or "the person on the left" and get back boxes and masks for it.

It is aimed at computer vision researchers and engineers who need to label or segment objects without training a model for each new class. Because it works zero-shot from text, it suits open-world tasks where the list of categories is not fixed in advance.

Within the multimodal vision space, Grounded-SAM acts as a building block rather than a single model. Each part can be used on its own or swapped for a similar model (for example replacing Grounding DINO with another detector), and the project ships extra demos that pair it with tools like RAM for automatic labeling and Stable Diffusion for inpainting.

What it does

Open-vocabulary detection and segmentation driven by free-text prompts, no per-class training
Combines Grounding DINO (text-to-box) with Segment Anything (box-to-mask) in one pipeline
Modular design — each model can run standalone or be replaced with a similar one
RAM-Grounded-SAM demo for automatic image labeling and annotation pipelines
EfficientSAM variants (FastSAM, MobileSAM, HQ-SAM) for faster or higher-quality masks
Also available through Hugging Face Transformers for Grounding DINO and Grounded SAM

Getting started

Grounded-SAM runs locally with Python 3.8+, PyTorch 1.7+, and torchvision 0.8+. Clone the repo, install the two submodules and dependencies, download the pretrained weights, then run the demo with a text prompt.

Clone the repository

Get the source, which includes the Segment Anything and GroundingDINO subfolders.

bashbash

git clone https://github.com/IDEA-Research/Grounded-Segment-Anything
cd Grounded-Segment-Anything

Install the models and dependencies

Install Segment Anything and Grounding DINO as editable packages, then the extra Python dependencies.

bashbash

python -m pip install -e segment_anything
pip install --no-build-isolation -e GroundingDINO
pip install opencv-python pycocotools matplotlib onnxruntime onnx ipykernel

Download the pretrained weights

Fetch the SAM ViT-H checkpoint and the Grounding DINO SwinT checkpoint into the repo folder.

bashbash

wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth

Run the demo with a text prompt

Point the demo at an image and a prompt; it writes the detection and segmentation results to the output directory.

bashbash

python grounded_sam_demo.py \
  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
  --grounded_checkpoint groundingdino_swint_ogc.pth \
  --sam_checkpoint sam_vit_h_4b8939.pth \
  --input_image assets/demo1.jpg \
  --output_dir "outputs" \
  --text_prompt "bear"

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Auto-labeling image datasets by describing target classes in text instead of drawing masks by hand
Zero-shot segmentation of objects that were never in a fixed training label set
Building annotation pipelines, for example pairing RAM-Grounded-SAM to tag and segment images automatically
Prototyping open-world vision workflows that combine detection, segmentation, and downstream editing or inpainting

How Grounded-SAM compares

Grounded-SAM alongside other open-source vision & understanding tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
PaddleOCR	★ 83.1k	A toolkit for detecting and recognizing text in images across many languages, plus document parsing.
Ultralytics YOLO	★ 58.6k	A framework for training and running YOLO models for real-time object detection, segmentation, and tracking.
Supervision	★ 44.7k	A Python toolkit for processing, annotating, and visualizing detections and segmentations from many vision models.
MMDetection	★ 32.8k	An OpenMMLab toolbox with many object detection and instance segmentation algorithms for research and production.
Segment Anything 2 (SAM 2)	★ 19.4k	Meta's model for segmenting and tracking any object across images and video frames from clicks or boxes.
Grounded-SAM	★ 17.6k	Detect and segment anything in an image from a text prompt
DINOv3	★ 10.7k	Meta's self-supervised vision backbone that produces general-purpose image features for many downstream tasks.
Segment Anything 3 (SAM 3)	★ 10.6k	Meta's segmentation model that detects, segments, and tracks objects in images and video from text or visual prompts.

// Overview

// What it does

// Getting started

Clone the repository

Install the models and dependencies

Download the pretrained weights

Run the demo with a text prompt

// When to use it

// How Grounded-SAM compares

Overview

What it does

Getting started

When to use it

How Grounded-SAM compares