Overview
Z-Image is an open-source text-to-image model family from Alibaba's Tongyi lab, built on a single-stream diffusion transformer (S3-DiT). At 6 billion parameters, it is small enough to run on a single GPU while still producing photorealistic images and rendering both English and Chinese text in the picture.
The family has several variants. Z-Image-Turbo is a distilled version that generates an image in only 8 sampling steps (8 NFEs) and fits within 16GB of VRAM, so it works on many consumer cards. The base Z-Image model focuses on higher-quality and more diverse output and is meant for fine-tuning, while Z-Image-Edit and the Omni-Base checkpoint target image editing.
It sits in the image-generation space alongside other open diffusion models, and is aimed at developers who want a self-hostable model they can run locally, integrate into a pipeline, or fine-tune for their own use rather than calling a hosted API.
What it does
- 6B-parameter single-stream diffusion transformer (S3-DiT) that runs on one GPU
- Z-Image-Turbo generates images in 8 steps (8 NFEs) and fits in 16GB VRAM
- Photorealistic image generation with strong aesthetic quality
- Bilingual text rendering for English and Chinese inside images
- Multiple variants: a base model for fine-tuning, plus Turbo, Edit, and Omni-Base checkpoints
- Works through the diffusers library via a ZImagePipeline
Getting started
Z-Image runs through Hugging Face diffusers. Install diffusers from source (Z-Image support landed recently), then load the Turbo checkpoint and generate an image.
Install diffusers from source
Z-Image support is in the latest diffusers, so install it directly from the GitHub main branch.
pip install git+https://github.com/huggingface/diffusersGenerate an image with Z-Image-Turbo
Load the ZImagePipeline, move it to CUDA, and run a prompt. Turbo uses few steps and guidance_scale 0.0.
import torch
from diffusers import ZImagePipeline
pipe = ZImagePipeline.from_pretrained(
"Tongyi-MAI/Z-Image-Turbo",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=False,
)
pipe.to("cuda")
image = pipe(
prompt="Young Chinese woman in red Hanfu, intricate embroidery",
height=1024,
width=1024,
num_inference_steps=9,
guidance_scale=0.0,
generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("example.png")Or install the repo for native inference
To run the repo's own PyTorch inference code, clone Z-Image and install it in editable mode inside your virtual environment.
git clone https://github.com/Tongyi-MAI/Z-Image.git
cd Z-Image
pip install -e .Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Generate photorealistic images locally on a single 16GB consumer GPU instead of calling a paid hosted API
- Create graphics that need correct English or Chinese text rendered inside the image
- Fine-tune the base Z-Image model on your own dataset for a custom style or domain
- Add text-to-image generation to an app or pipeline through the diffusers ZImagePipeline
How Z-Image compares
Z-Image alongside other open-source image generation tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Stable Diffusion web UI (AUTOMATIC1111) | ★ 164k | A browser interface for running Stable Diffusion image generation locally with extensions and fine-grained controls. |
| ComfyUI | ★ 118k | A node-based visual editor for building and running image and video generation pipelines like Stable Diffusion and FLUX locally. |
| Fooocus | ★ 50.4k | A simplified image generation app built on Stable Diffusion that hides technical settings for easy prompting. |
| InvokeAI | ★ 27.5k | A self-hosted creative tool and canvas for generating and editing images with open diffusion models. |
| Stability-AI generative-models | ★ 27.2k | Stability AI's official code for its Stable Diffusion family of image and video generation models. |
| FLUX | ★ 25.6k | Black Forest Labs' open-weight diffusion models and inference code for generating and editing images from text prompts. |
| Z-Image | ★ 11.6k | Alibaba Tongyi's 6B open image model that renders photoreal pictures on one consumer GPU |
| DALLE2-pytorch | ★ 11.3k | An open implementation of DALL-E 2 in PyTorch, with the CLIP encoder, diffusion prior, and cascading decoder you train to generate images from text. |