AI/TLDR

Segment Anything 3 (SAM 3)

Detect, segment, and track objects in images and video from text or visual prompts

Overview

Segment Anything 3 (SAM 3) is a foundation model from Meta for promptable segmentation in images and video. It can detect, segment, and track objects using either text phrases or visual prompts such as points, boxes, and masks, all from a single unified model.

Compared with SAM 2, SAM 3 adds the ability to find every instance of an open-vocabulary concept described by a short text phrase or by example images (exemplars). A presence token helps it tell apart closely related prompts, such as "a player in white" versus "a player in red", and a decoupled detector-tracker design keeps detection and tracking from interfering with each other.

It is aimed at computer-vision engineers and researchers who need instance masks across many object categories without training a custom detector for each one. The checkpoints are gated on Hugging Face, so you request access before downloading the weights.

What it does

  • Open-vocabulary segmentation: find all instances of a concept from a short text phrase
  • Accepts visual prompts too — points, boxes, masks, and image exemplars
  • Works on both images and video, with object tracking across frames
  • Presence token improves discrimination between closely related text prompts
  • Decoupled detector-tracker architecture that scales with data
  • SAM 3.1 checkpoints add shared-memory joint multi-object tracking for faster inference

Getting started

SAM 3 needs a CUDA GPU and gated checkpoints from Hugging Face. Set up the environment, install the package, authenticate, then run a text prompt against an image.

Check prerequisites

You need Python 3.12+, PyTorch 2.7+, and a CUDA-compatible GPU with CUDA 12.6 or higher.

Create the environment and install PyTorch

Set up a Conda environment and install PyTorch with CUDA support.

bashbash
conda create -n sam3 python=3.12
conda deactivate
conda activate sam3
pip install torch==2.10.0 torchvision --index-url https://download.pytorch.org/whl/cu128

Clone and install SAM 3

Clone the repository and install the package in editable mode.

bashbash
git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .

Authenticate and run a text prompt

Request access to the checkpoints on the SAM 3 Hugging Face repo, log in with hf auth login, then load an image and prompt with text.

pythonpython
import torch
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load the model
model = build_sam3_image_model()
processor = Sam3Processor(model)

# Load an image
image = Image.open("<YOUR_IMAGE_PATH.jpg>")
inference_state = processor.set_image(image)

# Prompt the model with text
output = processor.set_text_prompt(state=inference_state, prompt="<YOUR_TEXT_PROMPT>")

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Segment every instance of a category (e.g. "every car" or "a player in red") in an image using a text phrase
  • Track segmented objects across frames in a video
  • Build a labeling or annotation tool that masks objects from points, boxes, or example crops
  • Prototype open-vocabulary detection without training a per-class detector

How Segment Anything 3 (SAM 3) compares

Segment Anything 3 (SAM 3) alongside other open-source vision & understanding tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
PaddleOCR★ 83.1kA toolkit for detecting and recognizing text in images across many languages, plus document parsing.
Ultralytics YOLO★ 58.6kA framework for training and running YOLO models for real-time object detection, segmentation, and tracking.
Supervision★ 44.7kA Python toolkit for processing, annotating, and visualizing detections and segmentations from many vision models.
MMDetection★ 32.8kAn OpenMMLab toolbox with many object detection and instance segmentation algorithms for research and production.
Segment Anything 2 (SAM 2)★ 19.4kMeta's model for segmenting and tracking any object across images and video frames from clicks or boxes.
Grounded-SAM★ 17.6kA pipeline that combines Grounding DINO and Segment Anything to detect and segment objects from text prompts.
DINOv3★ 10.7kMeta's self-supervised vision backbone that produces general-purpose image features for many downstream tasks.
Segment Anything 3 (SAM 3)★ 10.6kDetect, segment, and track objects in images and video from text or visual prompts