OmniShow

Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Donghao Zhou1,* Guisheng Liu2,* Hao Yang2 Jiatong Li2,† Jingyu Lin3 Xiaohu Huang4 Yichen Liu2 Xin Gao2 Cunjian Chen3 Shilei Wen2,§ Chi-Wing Fu1 Pheng-Ann Heng1,§
1The Chinese University of Hong Kong 2ByteDance 3Monash University 4The University of Hong Kong
*Equal contribution Project lead §Corresponding author

TL;DR: We study Human-Object Interaction Video Generation (HOIVG) and present OmniShow, an end-to-end framework that unifies text, reference image, audio, and pose conditions to synthesize high-quality HOI videos.

Reference-to-Video Generation (R2V)

OmniShow achieves high-fidelity appearance and natural interaction with reference image injection, compared to HunyuanCustom, HuMo-17B, VACE, and Phantom-14B.

OmniShow (Ours)
HunyuanCustom
HuMo-17B
VACE
Phantom-14B
OmniShow (Ours)
HunyuanCustom
HuMo-17B
VACE
Phantom-14B

Reference+Audio-to-Video Generation (RA2V)

With audio input involved, OmniShow preserves reference identity and aligns motion to audio more reliably than HunyuanCustom and HuMo-17B.

OmniShow (Ours)
HunyuanCustom
HuMo-17B

Reference+Pose-to-Video Generation (RP2V)

Given reference images and pose, OmniShow better follows motion trajectories while maintaining object interaction authenticity compared with AnchorCrafter and VACE.

OmniShow (Ours)
AnchorCrafter
VACE

Reference+Audio+Pose-to-Video Generation (RAP2V)

OmniShow uniquely supports joint text+reference+audio+pose input and achieves stable generation with precise condition alignment.

OmniShow (Ours)

More Features

More capabilities enabled by OmniShow’s multimodal conditioning.

Lifelike Motion Quality
Smooth motion with rich and coherent dynamics.
Robust Physical Plausibility
More stable contact, grasping, and fewer penetrations.
Native Long-Shot Generation
Generate longer continuous shots, up to 10 seconds.
Expressive Avatar Animation
Vivid talking and singing from a human image and audio input.
Stable Identity Preservation
Maintain highly consistent character appearance across diverse scenarios (These videos are generated for research purposes only).