Our study addresses the challenge of generating realistic facial animations from voice inputs, particularly in environments with variable noise levels and types.
We propose a novel approach that combines a multi-layer convolutional encoder-decoder denoising module, equipped with U-Net-like skip connections, and a NeRF-based talking face generation model. This model is designed to process noisy audio inputs and generate synchronized facial expressions and lip movements.Our study achieved a notable enhancement compared with the baseline model in the SyncNet Confidence Score, a metric assessing the alignment between speech audio and the corresponding video. This advancement demonstrates the model’s ability to produce more realistic and dependable facial animations using noisy inputs. These improvements have promising potential for applications in real-life scenarios, such as virtual reality and assistive technologies.