Audio-sync Video Instance Editing with
Granularity-Aware Mask Refiner

Haojie Zheng1,2†     Shuchen Weng2,1†     Jingqi Liu1,2     Siqi Yang2
Boxin Shi1‡     Xinlong Wang2‡
1Peking University    2Beijing Academy of Artificial Intelligence
Equal contribution      Corresponding author

(For the best experience, please enable the audio)

1. Representative application scenarios

(a) Speech Modification
... A woman with blonde hair, wearing a blue coat ... says, "No, please. When the latkes are frying, the kitchen should be closed." ...
Input
Mask
Result
(b) Appearance Alteration
... A young man wearing a flat cap, dark shirt, and a camel-colored coat is walking along a sidewalk ...
Input
Mask
Result
(c) Semantic Category Transformation
... A white cat with a wistful expression is sitting on the floor ... letting out a soft "meow" ...
Input
Mask
Result
(d) Audio-Driven Dynamics Adjustment
... A warmly lit, domestic bathroom ... a close-up shot of a faucet, with water flowing ...
Input
Mask
Result

2. Additional application scenarios

(a) Instance Insertion
... A vintage station wagon with a two-tone paint job, primarily red and white ... enters the scene from the right ...
Input
Mask
Result
(b) Instance Removal
... Restore the masked region naturally, keeping the original scene consistent ... preserving the style ... of the surrounding content ...
Input
Mask
Result
(c) Audio-Sync Video Generation
... A man is riding on horseback ... wearing a red jacket and a white shirt ... the camera follows the riders from a side perspective ...
Input
Input Image
Mask
Mask Image
Result
(d) Long-Duration Video Editing
... A white woman with shoulder-length black hair ... wearing a white long-sleeved sweater ...
Input
Mask
Result

3. Optional control contexts

(a) Scrrible Control
... A woman wears a burgundy coat over a grey shirt ... a pendant hanging ... says, "I really admire his books, and he is famous." ...
Input
Mask
Scribble
Reference
Result
(b) Pose Control
... Two women are engaged in a conversation ... the woman on the left, with long hair and wearing a blue top ...
Input
Mask
Pose
Reference
Result
(c) Reference Control
... A black jeep begins to drive away from its parking spot ... the camera follows its movement, catching the car as it eases onto the street ...
Input
Mask
Reference
Reference
Result

4. Comparison with state-of-the-art methods

... A man wearing a blue jacket with a red inner shirt whose collar is folded outward ... says, "They erase me, but I remember everything." ...
Input
Mask
AVI-Edit (Ours)
AvED
Ovi
VACE-Foley
... A vintage car is driving along a winding road with hedges and trees ... it is colored dark with a shiny exterior ...
Input
Mask
AVI-Edit (Ours)
AvED
Ovi
VACE-Foley

Citation

If you find this work useful for your research, please consider citing our paper:

@article{avi-edit,
  title={Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner},
  author={Zheng, Haojie and Weng, Shuchen and Liu, Jingqi and Yang, Siqi and Shi, Boxin and Wang, Xinlong},
  journal={arXiv preprint arXiv:2512.10571},
  year={2025}
}