Overview
SAM 2 (Segment Anything Model 2) is a model from Meta's FAIR team for promptable visual segmentation. You give it a prompt — a click, a set of points, or a box — and it returns a mask for the object you pointed at. It extends the original Segment Anything model from single images to video by treating an image as a one-frame video.
The main step beyond SAM is video tracking. SAM 2 uses a transformer with streaming memory, so once you mark an object in one frame it can follow that object across the rest of the clip. The 2.1 checkpoints come in four sizes (tiny, small, base-plus, large) so you can trade accuracy for speed.
It sits in the computer vision space and is meant for developers and researchers who need object masks rather than just bounding boxes. Typical users are people building annotation tools, video editing features, or datasets, and it ships with both an image predictor and a video predictor you call from Python.
What it does
- Promptable segmentation from clicks, point sets, or boxes on static images
- Object tracking across video frames using a streaming memory architecture
- Multi-object tracking with per-object inference, including adding new objects after tracking has started
- Four SAM 2.1 checkpoint sizes (tiny, small, base-plus, large) to balance speed and accuracy
- torch.compile support for the full model on video for faster VOS inference (vos_optimized=True)
- Separate SAM2ImagePredictor and video predictor APIs, plus released training and fine-tuning code
Getting started
SAM 2 needs Python 3.10+, PyTorch 2.5.1+, and torchvision 0.20.1+, and is best run on a GPU machine. Clone the repo, install it, download a checkpoint, then call the predictor.
Install from source
Clone the repository and install it in editable mode. On Windows, the README recommends using WSL with Ubuntu.
git clone https://github.com/facebookresearch/sam2.git && cd sam2
pip install -e .Download a checkpoint
Fetch the SAM 2.1 model checkpoints with the provided script (or download a single .pt file individually).
cd checkpoints && \
./download_ckpts.sh && \
cd ..Segment an image
Load the image predictor, set your image, then pass your prompts to predict() to get masks.
import torch
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
predictor.set_image(<your_image>)
masks, _, _ = predictor.predict(<input_prompts>)Track objects in a video
Use the video predictor: initialize state, add prompts on a frame, then propagate the masks across the whole video.
import torch
from sam2.build_sam import build_sam2_video_predictor
checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = build_sam2_video_predictor(model_cfg, checkpoint)
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
state = predictor.init_state(<your_video>)
# add new prompts and instantly get the output on the same frame
frame_idx, object_ids, masks = predictor.add_new_points_or_box(state, <your_prompts>)
# propagate the prompts to get masklets throughout the video
for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
...Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Speed up dataset annotation by letting annotators click an object and get a clean mask instead of drawing by hand
- Track a selected object across the frames of a video for editing, effects, or analysis
- Build interactive segmentation into a tool where a user clicks or drags a box to select things
- Fine-tune the model on a custom domain using the released training code
How Segment Anything 2 (SAM 2) compares
Segment Anything 2 (SAM 2) alongside other open-source vision & understanding tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| PaddleOCR | ★ 83.1k | A toolkit for detecting and recognizing text in images across many languages, plus document parsing. |
| Ultralytics YOLO | ★ 58.6k | A framework for training and running YOLO models for real-time object detection, segmentation, and tracking. |
| Supervision | ★ 44.7k | A Python toolkit for processing, annotating, and visualizing detections and segmentations from many vision models. |
| MMDetection | ★ 32.8k | An OpenMMLab toolbox with many object detection and instance segmentation algorithms for research and production. |
| Segment Anything 2 (SAM 2) | ★ 19.4k | Prompt, segment, and track any object across images and video frames |
| Grounded-SAM | ★ 17.6k | A pipeline that combines Grounding DINO and Segment Anything to detect and segment objects from text prompts. |
| DINOv3 | ★ 10.7k | Meta's self-supervised vision backbone that produces general-purpose image features for many downstream tasks. |
| Segment Anything 3 (SAM 3) | ★ 10.6k | Meta's segmentation model that detects, segments, and tracks objects in images and video from text or visual prompts. |