Audio-Sync Video Generation with Multi-Stream Temporal Control

Shuchen Weng1†, Haojie Zheng1,2†, Chang Zheng3, Si Li3, Boxin Shi2‡, Xinlong Wang1‡
1BAAI   2PKU   3BUPT
(† Equal contribution   ‡ Corresponding author)

Versatile capabilities

Character-centric narrative
... A man in a brown jacket and a blue shirt ... talking on a mobile phone ...
... The driver, who is wearing a brown leather jacket ... holding the steering ...
Multi-character interaction
... A male on the left, wearing a black suit ... A woman ... wearing a red dress ...
... A woman ... wearing a white blouse with brown hair ... A male ... black buttons ...
Sound-triggered events
... The focus is on the water's movement as it is poured into the glass from above ...
... A person is ascending a dimly lit staircase in a building ...
Music-shaped ambiance
... A young woman ... flowers with purple and yellow ... A man ... a dark jacket ...
... The driver, who is wearing a brown leather jacket ... holding the steering ...
Camera movement
... An old, rusted car driving on a suburban street ... with a faded white paint job ...
... A view through an ornate archway into a dimly lit space ... light shines down ...

Application

Character creation
Keyframe guidance
... Wolf wearing sunglasses and owl wearing suit ...
... A woman with blonde hair ... a red dress with white fur ...
Long video generation
Scene transitions
... A woman with shoulder-length dark hair ... waering light-colored blazer ...
... A dark knight ... horse at a steady gallop across a vast grassland/snow-covered field ...

Controllability

Event timing
Lip motion
... A woman stands in a dimly lit corridor, bathed in a blue hue ...
... A man with a beard and long dark hair is performing on stage ...
Appearance
... short dark hair ... a red long-sleeved shirt ...
... middle dark hair ... a black long-sleeved shirt ...
... long dark hair ... a gray long-sleeved shirt ...
Visual mood
... A group of birds is present on a sandy beach with gentle waves ...

Comparisons with state-of-the-art methods

... The man has gray hair ... The woman has long black hair, is wearing a denim shirt ...
Ours
MM-Diffusion
TempoTokens
Xing et al.
... A person stands on a stage with a guitar, illuminated by red and purple stage lights ...
Ours
MM-Diffusion
TempoTokens
Xing et al.

BibTeX

@article{MTV,
      title={Audio-Sync Video Generation with Multi-Stream Temporal Control},
      author={Weng, Shuchen and Zheng, Haojie and Chang, Zheng and Li, Si and Shi, Boxin and Wang, Xinlong},
      journal={arXiv preprint arXiv:2506.08003},
      year={2025}
}