Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

Runyi Yu¹ Tianyu He Ailing Zhang¹ Yuchi Wang¹ Junliang Guo Xu Tan Chang Liu³ Jie Chen^1,2 Jiang Bian

¹Peking University ²Peng Cheng Laboratory ³Tsinghua University

Abstract

We aim to edit the lip movements in talking video according to the given speech while preserving the personal identity and visual details. The task can be decomposed into two sub-problems: (1) speech-driven lip motion generation and (2) visual appearance synthesis.

Current solutions handle the two sub-problems within a single generative model, resulting in a challenging trade-off between lip-sync quality and visual details preservation. Instead, we propose to disentangle the motion and appearance, and then generate them one by one with a speech-to-motion diffusion model and a motion-conditioned appearance generation model.

However, there still remain challenges in each stage, such as motion-aware identity preservation in (1) and visual details preservation in (2). Therefore, to preserve personal identity, we adopt landmarks to represent the motion, and further employ a landmark-based identity loss. To capture motion-agnostic visual details, we use separate encoders to encode the lip, non-lip appearance and motion, and then integrate them with a learned fusion module. We train MyTalk on a large-scale and diverse dataset.

Experiments show that our method generalizes well to the unknown, even out-of-domain person, in terms of both lip sync and visual detail preservation.

Method

Our proposed MyTalk adopts a motion-appearance disentangled two-stage framework to realize talking video lip sync. (a) In the first stage, we adopt a speech-driven motion generation model to generate motion (i.e., landmark) sequences from the input speech with the diffusion model. (b) To better preserve the motion identity, we design an identity extractor and the corresponding identity loss in the motion generation model. (c) In the second stage, we use separate encoders to encode the motion-agnostic lip, non-lip appearance, and the generated motion. The encoded representations are fused with a FusionNet and decoded to the output video.

Quality Comparison

Appearance Controllable Lip Sync Generation

Benefiting from our disentangled modeling, MyTalk can accurately integrate the appearance and motion conditions into generated videos as separate factors, which enables us to edit the lip region appearance by providing the model with a variety of reference images.

Emotion Controllable Lip Sync Generation

In addition, we can emotionally control the talking video by using the emotion reference to guide both motion and appearance generation.

Risks and responsible AI considerations

MyTalk focuses on generating lip-syncing for avatars, with the goal of enhancing positive applications such as language interpretation and generated AI avatar talking. We are committed to developing AI responsibly to improve human welfare, while being aware of the potential for misuse like impersonation.

To ensure our technology is used ethically and complies with regulations, we will not release APIs, products, or detailed implementations until we are certain of responsible usage.

Citation

                @misc{yu2024make,
                    title={Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement}, 
                    author={Runyi Yu and Tianyu He and Ailing Zhang and Yuchi Wang and Junliang Guo and Xu Tan and Chang Liu and Jie Chen and Jiang Bian},
                    year={2024},
                    eprint={2406.08096},
                    archivePrefix={arXiv},
                    primaryClass={cs.CV}
                }