InstructAV2AV: Instruction-Guided Audio-Video Joint Editing

Haojie Zheng1,2, Yixin Yang2, Siqi Yang2, Shuchen Weng1,2*, Boxin Shi2*

1 Beijing Academy of Artificial Intelligence     2 Peking University

* Corresponding authors

Demo

(For the best experience, please enable the audio and wear headphones)

Representative Application Scenarios

▶️ Tip: Clicking a category tab or a case button will auto-play the input and output videos in sequence.

Instruction: Keep the person’s identity and change the spoken words to "This is more than just art, it’s a statement."

Input

Output

Instruction: Change the man into a young woman with brown hair, wearing a gray blazer over a light pink top and a necklace with a heart-shaped pendant, and saying, "I really think we should give it another chance."

Input

Output

Instruction: Add a dark vintage sedan driving from the right to the left.

Input

Output

Instruction: Remove the chipmunk standing on the stone surface among the peanuts.

Input

Output

Additional Application Scenarios

▶️ Tip: Clicking a category tab or a case button will auto-play the input and output videos in sequence.

Instruction: Keep the person‘s appearance, change the timbre to a man, and change the spoken words to "I understand, but I think we need to consider."

Input

Output

Instruction: Keep the spoken content, and change the man to a woman dressed in a dark blazer over a red sweater.

Input

Output

Instruction: Keep the timbre, change the woman to a red-haired woman wearing a white shirt, and change the spoken words to "I came here to tell you that you should to go."

Input

Output

Comparison with State-of-the-Art Methods

▶️ Tip: Clicking a result tab will auto-play the input and output videos in sequence.

Instruction: Change the man into a woman with long dark hair, wearing a red and black checkered shirt, and saying, "Wait, what? That can't be right... Oh no, did I just miss the deadline now?"
Input
AvED
CoherentAVEdit
AVI-Edit
InstructAV2AV
Instruction: Make the horse dark brown with a white saddle.
Input
AvED
CoherentAVEdit
AVI-Edit
InstructAV2AV

Citation

@article{InstructAV2AV,
      title={InstructAV2AV: Instruction-Guided Audio-Video Joint Editing},
      author={Zheng, Haojie and Yang, Yixin and Yang, Siqi and Weng, Shuchen and Shi, Boxin},
      journal={arXiv preprint arXiv:2605.18467},
      year={2026}
}