ByteDance Seed · 2026-04-15 · notable

Seedance 2.0 — ByteDance's Unified Audio-Video Generation Model

Item: Seedance 2.0 — ByteDance's Unified Audio-Video Generation Model
Rating: 3
Author: AI/TLDR

ByteDance publishes Seedance 2.0: jointly generates audio and video in one pass from text, image, audio, and video inputs. Produces 4–15 second clips at 480p/720p; globally available via CapCut, Dreamina, and Runway.

Seedance 2.0 paper thumbnail — ByteDance Seed's unified audio-video generation model

ByteDance's video model natively generates synchronized audio and video together in one forward pass — no separate audio pipeline.

Key specs

Max output duration	15 seconds
Output resolutions	480p / 720p
Max image references	9
Max video references	3
Max audio references	3

What is it?

Seedance 2.0 is ByteDance's current flagship video generation model, published as an arXiv paper on April 15, 2026, and globally available via CapCut, Dreamina, and Runway. It accepts text, images, audio clips, and video clips as inputs and generates 4–15 second audio-video clips at 480p or 720p. A Fast variant targets low-latency scenarios. The paper documents substantial improvements over Seedance 1.5 Pro across all evaluated dimensions.

How does it work?

Seedance 2.0 uses a Dual-branch Diffusion Transformer (DiT) that encodes all input modalities — text prompts, reference images, audio clips, and video clips — into a shared latent space. Audio and video are denoised jointly rather than sequentially, producing output where sound and motion are intrinsically synchronized. The multimodal reference system allows up to 9 images, 3 video clips, and 3 audio clips simultaneously to guide composition, camera movement, motion style, and audio character.

Why does it matter?

Most current AI video tools either skip audio entirely or add it as a post-processing step requiring manual alignment. Native joint generation means the model learns the relationship between motion and sound during training, producing output where audio does not feel bolted on. For video production workflows, this reduces the number of separate tools in a pipeline and opens use cases like generating a scene from a reference audio track plus a reference image in a single step.

Who is it for?

Video creators and developers building video generation pipelines who need audio-video synchronization without post-production.

Try it

fal.ai/seedance-2.0