Audio-driven talking face generation with diverse yet realistic facial animations

Page view(s)

Checked on Aug 10, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/19275

Title:

Audio-driven talking face generation with diverse yet realistic facial animations

Journal Title:

Pattern Recognition

DOI:

10.1016/j.patcog.2023.109865

Publication URL:

http://dx.doi.org/10.1016/j.patcog.2023.109865

Authors:

Rongliang Wu, Yingchen Yu, Fangneng Zhan, Jiahui Zhang, Xiaoqin Zhang, Shijian LU

Keywords:

Software, Computer Vision and Pattern Recognition, artificial intelligence, Signal processing

Publication Date:

04 August 2023

Citation:

Wu, R., Yu, Y., Zhan, F., Zhang, J., Zhang, X., & Lu, S. (2023). Audio-driven talking face generation with diverse yet realistic facial animations. Pattern Recognition, 144, 109865. https://doi.org/10.1016/j.patcog.2023.109865

Abstract:

Audio-driven talking face generation, which aims to synthesize talking faces with realistic facial animations (including accurate lip movements, vivid facial expression details and natural head poses) corresponding to the audio, has achieved rapid progress in recent years. However, most existing work focuses on generating lip movements only without handling the closely correlated facial expressions, which degrades the realism of the generated faces greatly. This paper presents DIRFA, a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio. To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network that can model the variational facial animation distribution conditioned upon the input audio and autoregressively convert the audio signals into a facial animation sequence. In addition, we introduce a temporally-biased mask into the mapping network, which allows to model the temporal dependency of facial animations and produce temporally smooth facial animation sequence. With the generated facial animation sequence and a source image, photo-realistic talking faces can be synthesized with a generic generation network. Extensive experiments show that DIRFA can generate talking faces with realistic facial animations effectively.

License type:

Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Funding Info:

This research / project is supported by the Ministry of Education, Singapore - Tier-1
Grant Reference no. : RG94/20

Description:

URI:

https://oar.a-star.edu.sg/communities-collections/articles/19275

ISSN:

0031-3203

Collections:

Institute for Infocomm Research

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
dirfa-oars.pdf	1.08 MB	PDF	Open