Audio-Sync Video Generation with Multi-Stream Temporal Control

Shuchen Weng1†, Haojie Zheng1,2†, Chang Zheng3, Si Li3, Boxin Shi2‡, Xinlong Wang1‡
1BAAI   2PKU   3BUPT
(† Equal contribution ‡ Corresponding author)

 

 

Visualization of subset divsion


Basic face

Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music

Single character

Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music

Multiple characters

Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music

Sound event

Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music

Visual mood

Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music
Speech
Effects
Music

BibTeX

@article{MTV,
      title={Audio-Sync Video Generation with Multi-Stream Temporal Control},
      author={Weng, Shuchen and Zheng, Haojie and Chang, Zheng and Li, Si and Shi, Boxin and Wang, Xinlong},
      journal={arXiv preprint arXiv:2506.08003},
      year={2025}
}