We propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person, where the output video has synchronized, realistic, and expressive rich body dynamics. The system is achieved by first generating 3D skeleton movements from the audio using a RNN, then synthesizing the output video via a conditional GAN.
ACCV 2020