General Pipeline of Face Talking Video Generation From Noisy Speech
(The noisy speech audio input goes through a denoiser module. The denoised speech is then inputted with a target head video to the Talking head module. The final output is a generated video of the person talking in input audio.)



Comparison lip sync result on Overlapping Speech (the 4th is from our model)

Comparison of lip sync result on Environmental Noise (the 4th is from our model)




Enhancing Voice-driven Face Generation with Audio Denoising

from IDL Team Project


Our study addresses the challenge of generating realistic facial animations from voice inputs, particularly in environments with variable noise levels and types.

We propose a novel approach that combines a multi-layer convolutional encoder-decoder denoising module, equipped with U-Net-like skip connections, and a NeRF-based talking face generation model. This model is designed to process noisy audio inputs and generate synchronized facial expressions and lip movements.Our study achieved a notable enhancement compared with the baseline model in the SyncNet Confidence Score, a metric assessing the alignment between speech audio and the corresponding video. This advancement demonstrates the model’s ability to produce more realistic and dependable facial animations using noisy inputs. These improvements have promising potential for applications in real-life scenarios, such as virtual reality and assistive technologies.

Full Report: link