Overview
generative-models is the official open-source codebase from Stability AI for its Stable Diffusion family. It collects the model code, sampling scripts, and configs for SDXL (text-to-image) and the Stable Video line for video and multi-view generation, including Stable Video Diffusion (SVD), SV3D, SV4D, and SV4D 2.0.
It is aimed at researchers and engineers who want to run or build on these models directly rather than through a hosted API. You download model weights from Hugging Face, drop them into a checkpoints/ folder, and call the sampling scripts in scripts/sampling/. The repo uses a config-driven design that separates samplers and guiders from the core diffusion models.
Within the image-generation category, it sits at the source: it is the reference implementation many other tools and pipelines wrap. If you need the actual Stability AI code and want control over the inference loop, this is where the models live.
What it does
- SDXL text-to-image models (base, refiner, and SDXL-Turbo) with ready-to-run sampling scripts
- Stable Video Diffusion (SVD / SVD-XT) for image-to-video synthesis
- SV3D for novel-view and multi-view generation from a single image
- SV4D and SV4D 2.0 video-to-4D models for novel-view video synthesis of moving objects
- Config-driven architecture that separates samplers and guiders from the diffusion model code
- Inference scripts with options for low-VRAM runs (--encoding_t, --decoding_t, smaller --img_size) and optional background removal via rembg
Getting started
Set up a Python 3.10 environment, install the dependencies, then download weights from Hugging Face and run a sampling script. The example below shows the SV4D 2.0 video-to-4D flow from the README.
Create the environment and install dependencies
Make a virtual environment, install a CUDA-matched PyTorch build, then install the project requirements and package.
python3.10 -m venv .generativemodels
source .generativemodels/bin/activate
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # check CUDA version
pip3 install -r requirements/pt2.txt
pip3 install .
pip3 install -e git+https://github.com/Stability-AI/datapipelines.git@main#egg=sdataDownload model weights
Pull the checkpoint you want from Hugging Face into the checkpoints/ folder. This example fetches the SV4D 2.0 weights.
huggingface-cli download stabilityai/sv4d2.0 sv4d2.safetensors --local-dir checkpointsRun a sampling script
Call the matching script in scripts/sampling/ with your input. This runs SV4D 2.0 on an example video and writes results to outputs/.
python scripts/sampling/simple_video_sample_4d2.py --input_path assets/sv4d_videos/camel.gif --output_folder outputsCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Generate images locally with SDXL or SDXL-Turbo using the provided text-to-image sampling scripts
- Turn a single still image into a short video clip with Stable Video Diffusion
- Produce novel-view or multi-view renders of an object with SV3D for 3D research
- Run video-to-4D synthesis of a moving object with SV4D / SV4D 2.0 for novel-view video research
How Stability-AI generative-models compares
Stability-AI generative-models alongside other open-source image generation tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Stable Diffusion web UI (AUTOMATIC1111) | ★ 164k | A browser interface for running Stable Diffusion image generation locally with extensions and fine-grained controls. |
| ComfyUI | ★ 118k | A node-based visual editor for building and running image and video generation pipelines like Stable Diffusion and FLUX locally. |
| Fooocus | ★ 50.4k | A simplified image generation app built on Stable Diffusion that hides technical settings for easy prompting. |
| InvokeAI | ★ 27.5k | A self-hosted creative tool and canvas for generating and editing images with open diffusion models. |
| Stability-AI generative-models | ★ 27.2k | Stability AI's official code for the Stable Diffusion and Stable Video model family |
| FLUX | ★ 25.6k | Black Forest Labs' open-weight diffusion models and inference code for generating and editing images from text prompts. |
| Z-Image | ★ 11.6k | Alibaba Tongyi's 6B-parameter open image model that produces photorealistic images quickly on a single GPU. |
| DALLE2-pytorch | ★ 11.3k | An open implementation of DALL-E 2 in PyTorch, with the CLIP encoder, diffusion prior, and cascading decoder you train to generate images from text. |