Inclusion AI · 2026-04-22 · notable
LLaDA2.0-Uni — Unified Discrete Diffusion LLM for Multimodal Understanding and Generation
#1 HuggingFace paper today (129 upvotes). A 16B MoE discrete diffusion model that handles multimodal understanding AND image generation in one architecture using masked token prediction — no separate specialized components. Apache 2.0.

One 16B discrete diffusion model that both understands and generates images — no separate encoder or decoder head.
Key specs
| Parameters | 16B total, ~1B active per token |
|---|---|
| Gpu memory (generation) | ~47 GB |
| Gpu memory (understanding) | ~35 GB |
| Image generation steps | 8 (distilled decoder) |
| Hugging face upvotes | 129 |
What is it?
LLaDA2.0-Uni is a 16B Mixture-of-Experts discrete diffusion language model from Inclusion AI's AGI Research Center that unifies multimodal understanding (visual QA, captioning, document reading) and image generation (text-to-image, image editing) in a single architecture. Released April 22, 2026 with weights on HuggingFace and code on GitHub. Apache 2.0 license.
How does it work?
Visual inputs are tokenized into discrete semantic tokens via SigLIP-VQ, converting continuous pixel information into a vocabulary compatible with the discrete diffusion backbone. The MoE backbone applies block-level masked diffusion to both text and vision tokens uniformly. A lightweight diffusion decoder (trained with few-step distillation) converts output discrete tokens back to pixels in ~8 steps. The SPRINT acceleration system reuses KV caches and adaptively unmasks tokens to reduce inference cost.
Why does it matter?
Most multimodal systems chain separate specialized models — a VLM for understanding, a diffusion model for generation. LLaDA2.0-Uni collapses this into a single model trained with one objective, which simplifies pipelines and enables interleaved reasoning-then-generation tasks without model switching. The model matches specialized VLMs on understanding benchmarks while retaining strong image generation quality.
Who is it for?
ML researchers studying unified multimodal architectures; practitioners building image generation or editing pipelines.
Try it
pip install transformers && model = AutoModelForCausalLM.from_pretrained('inclusionAI/LLaDA2.0-Uni', trust_remote_code=True)