InstructAV2AV: Instruction-Guided Audio-Video Joint Editing
1 Beijing Academy of Artificial Intelligence 2 Peking University
* Corresponding authors
Demo
(For the best experience, please enable the audio and wear headphones)
Additional Application Scenarios
▶️ Tip: Clicking a category tab or a case button will auto-play the input and output videos in sequence.
Instruction:
Keep the person‘s appearance, change the timbre to a man, and change the spoken words to "I understand, but I think we need to consider."
Input
Output
Instruction:
Keep the spoken content, and change the man to a woman dressed in a dark blazer over a red sweater.
Input
Output
Instruction:
Keep the timbre, change the woman to a red-haired woman wearing a white shirt, and change the spoken words to "I came here to tell you that you should to go."
Input
Output
Comparison with State-of-the-Art Methods
▶️ Tip: Clicking a result tab will auto-play the input and output videos in sequence.
Instruction:
Change the man into a woman with long dark hair, wearing a red and black checkered shirt, and saying, "Wait, what? That can't be right... Oh no, did I just miss the deadline now?"
Input
AvED
CoherentAVEdit
AVI-Edit
InstructAV2AV
Instruction:
Make the horse dark brown with a white saddle.
Input
AvED
CoherentAVEdit
AVI-Edit
InstructAV2AV
Citation
@article{InstructAV2AV,
title={InstructAV2AV: Instruction-Guided Audio-Video Joint Editing},
author={Zheng, Haojie and Yang, Yixin and Yang, Siqi and Weng, Shuchen and Shi, Boxin},
journal={arXiv preprint arXiv:2605.18467},
year={2026}
}