DINOv3

Self-supervised vision backbones that produce general-purpose image features without fine-tuning

github.com/facebookresearch/dinov3★ 10.7k ai.meta.com/research/dinov3

Overview

DINOv3 is a family of vision foundation models from Meta AI Research (FAIR). It is trained with self-supervised learning, so it learns from large amounts of unlabeled images instead of hand-labeled datasets. The result is a backbone that turns any image into general-purpose feature vectors you can reuse across many downstream tasks.

It is meant for computer vision developers and researchers who need strong image representations without training a model from scratch. The released backbones range from a 21M-parameter ViT-S/16 up to a 6.7B-parameter ViT-7B/16, plus ConvNeXt variants, so you can trade off accuracy against compute. The same frozen features can drive image classification, semantic segmentation, monocular depth estimation, retrieval, and dense matching.

As a multimodal/computer-vision tool, DINOv3 sits at the feature-extraction layer of a vision pipeline. You load a pretrained backbone, run images through it to get patch-level or global features, and attach a lightweight head (often a linear probe) for your specific task. The models are also available through the Hugging Face Hub, Transformers, and timm.

What it does

Self-supervised training on the LVD-1689M web image dataset, so no labels are needed to learn the features
High-quality dense, patch-level features usable for segmentation, depth, and dense correspondence, not just whole-image classification
Multiple backbone sizes: ViT-S/S+/B/L/H+ distilled models, a 6.7B-parameter ViT-7B/16, and ConvNeXt variants
Loadable directly through torch.hub.load() from a local clone, pointing to downloaded weight URLs or files
Also supported by Hugging Face Transformers (>=4.56.0) and PyTorch Image Models / timm (>=1.0.20)
Reference code and configs for linear segmentation (ADE20K) and linear depth estimation (NYUv2-Depth)

Getting started

Set up the environment, request the pretrained weights from Meta, then load a backbone and extract features from an image. Note that the reference repo targets a Linux environment with PyTorch >= 2.7.1.

Create the environment

Clone the repo and create the provided conda/micromamba environment, which installs PyTorch and the other dependencies.

bashbash

micromamba env create -f conda.yaml
micromamba activate dinov3

Get the model weights

Request access on the DINOv3 downloads page. Once accepted, you receive URLs for the backbone and adapter weights, which you pass to torch.hub.load() as a URL or local file. Use wget rather than a browser to download.

Load a backbone and extract features

Load a backbone from your local clone, then run a normalized image through it to get features.

pythonpython

import torch

REPO_DIR = "<PATH/TO/DINOV3/REPO>"

# Load DINOv3 ViT backbone
dinov3_vitl16 = torch.hub.load(
    REPO_DIR,
    'dinov3_vitl16',
    source='local',
    weights="<CHECKPOINT/URL/OR/PATH>",
)

# Extract features from an image
from torchvision.transforms import v2
from PIL import Image

img = Image.open("image.jpg")
transform = v2.Compose([
    v2.ToImage(),
    v2.Resize((256, 256), antialias=True),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=(0.485, 0.456, 0.406),
                 std=(0.229, 0.224, 0.225)),
])

with torch.inference_mode():
    features = dinov3_vitl16(transform(img)[None])

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Extract reusable image embeddings as input to a custom classifier or downstream model without training a vision backbone yourself
Build semantic segmentation or monocular depth estimation by adding a linear probe on top of frozen DINOv3 features
Power image retrieval and similarity search using the global feature vectors
Use dense patch features for correspondence and matching, including specialized domains like satellite or microscopy imagery via the metadata-guided recipes

How DINOv3 compares

DINOv3 alongside other open-source vision & understanding tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
PaddleOCR	★ 83.1k	A toolkit for detecting and recognizing text in images across many languages, plus document parsing.
Ultralytics YOLO	★ 58.6k	A framework for training and running YOLO models for real-time object detection, segmentation, and tracking.
Supervision	★ 44.7k	A Python toolkit for processing, annotating, and visualizing detections and segmentations from many vision models.
MMDetection	★ 32.8k	An OpenMMLab toolbox with many object detection and instance segmentation algorithms for research and production.
Segment Anything 2 (SAM 2)	★ 19.4k	Meta's model for segmenting and tracking any object across images and video frames from clicks or boxes.
Grounded-SAM	★ 17.6k	A pipeline that combines Grounding DINO and Segment Anything to detect and segment objects from text prompts.
DINOv3	★ 10.7k	Self-supervised vision backbones that produce general-purpose image features without fine-tuning
Segment Anything 3 (SAM 3)	★ 10.6k	Meta's segmentation model that detects, segments, and tracks objects in images and video from text or visual prompts.

// Overview

// What it does

// Getting started