Z-Image

Alibaba Tongyi's 6B open image model that renders photoreal pictures on one consumer GPU

github.com/Tongyi-MAI/Z-Image★ 11.6k huggingface.co/Tongyi-MAI/Z-Image-Turbo

Overview

Z-Image is an open-source text-to-image model family from Alibaba's Tongyi lab, built on a single-stream diffusion transformer (S3-DiT). At 6 billion parameters, it is small enough to run on a single GPU while still producing photorealistic images and rendering both English and Chinese text in the picture.

The family has several variants. Z-Image-Turbo is a distilled version that generates an image in only 8 sampling steps (8 NFEs) and fits within 16GB of VRAM, so it works on many consumer cards. The base Z-Image model focuses on higher-quality and more diverse output and is meant for fine-tuning, while Z-Image-Edit and the Omni-Base checkpoint target image editing.

It sits in the image-generation space alongside other open diffusion models, and is aimed at developers who want a self-hostable model they can run locally, integrate into a pipeline, or fine-tune for their own use rather than calling a hosted API.

What it does

6B-parameter single-stream diffusion transformer (S3-DiT) that runs on one GPU
Z-Image-Turbo generates images in 8 steps (8 NFEs) and fits in 16GB VRAM
Photorealistic image generation with strong aesthetic quality
Bilingual text rendering for English and Chinese inside images
Multiple variants: a base model for fine-tuning, plus Turbo, Edit, and Omni-Base checkpoints
Works through the diffusers library via a ZImagePipeline

Getting started

Z-Image runs through Hugging Face diffusers. Install diffusers from source (Z-Image support landed recently), then load the Turbo checkpoint and generate an image.

Install diffusers from source

Z-Image support is in the latest diffusers, so install it directly from the GitHub main branch.

bashbash

pip install git+https://github.com/huggingface/diffusers

Generate an image with Z-Image-Turbo

Load the ZImagePipeline, move it to CUDA, and run a prompt. Turbo uses few steps and guidance_scale 0.0.

pythonpython

import torch
from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

image = pipe(
    prompt="Young Chinese woman in red Hanfu, intricate embroidery",
    height=1024,
    width=1024,
    num_inference_steps=9,
    guidance_scale=0.0,
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("example.png")

Or install the repo for native inference

To run the repo's own PyTorch inference code, clone Z-Image and install it in editable mode inside your virtual environment.

bashbash

git clone https://github.com/Tongyi-MAI/Z-Image.git
cd Z-Image
pip install -e .

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Generate photorealistic images locally on a single 16GB consumer GPU instead of calling a paid hosted API
Create graphics that need correct English or Chinese text rendered inside the image
Fine-tune the base Z-Image model on your own dataset for a custom style or domain
Add text-to-image generation to an app or pipeline through the diffusers ZImagePipeline

How Z-Image compares

Z-Image alongside other open-source image generation tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Stable Diffusion web UI (AUTOMATIC1111)	★ 164k	A browser interface for running Stable Diffusion image generation locally with extensions and fine-grained controls.
ComfyUI	★ 118k	A node-based visual editor for building and running image and video generation pipelines like Stable Diffusion and FLUX locally.
Fooocus	★ 50.4k	A simplified image generation app built on Stable Diffusion that hides technical settings for easy prompting.
InvokeAI	★ 27.5k	A self-hosted creative tool and canvas for generating and editing images with open diffusion models.
Stability-AI generative-models	★ 27.2k	Stability AI's official code for its Stable Diffusion family of image and video generation models.
FLUX	★ 25.6k	Black Forest Labs' open-weight diffusion models and inference code for generating and editing images from text prompts.
Z-Image	★ 11.6k	Alibaba Tongyi's 6B open image model that renders photoreal pictures on one consumer GPU
DALLE2-pytorch	★ 11.3k	An open implementation of DALL-E 2 in PyTorch, with the CLIP encoder, diffusion prior, and cascading decoder you train to generate images from text.

// Overview

// What it does

// Getting started