We present a video diffusion transformer that's conditioned on audio input.
The model, which we name Infinity V2, can be used to generate extremely expressive talking-head style videos from a single source image and audio.
Because Infinity V2 is an end-to-end model, facial expressions and head movements in the generated video always reflect the natural cadence and emotion of the audio.
This page highlights some of the Infinity V2 model results so far, including what works well, what doesn't work, and comparisons to existing talking head avatars.
Try it here: https://studio.infinity.ai/try-inf2
Here are a few things that have worked surprisingly well with the V2 model.
Despite being trained on an English-centric dataset, the model is able to generalize to other languages.
For example, the model can generate dangling earrings, reflections on glasses, and dynamic shadows.
Despite being trained exclusively on real-world videos, the model still does well on cartoons, paintings, and sculptures.
When conditioned on music, the model generates singing characters. It can handle different types of music, including pop and rap.
Here are some things Infinity V2 cannot yet handle.
The model does not animate animals or inanimate objects with characteristics of a face. It likely considers them part of the foreground.
This is distracting and annoying. The model also cannot handle source images in which hands are predominately featured.
The model can animate 2D and 3D cartoons but is less robust than on other image types. In particular, details in the facial structure often become distorted, such as the eyes. Frankly, we were surprised cartoons work at all, since they are not present in our training data.
This is most noticeable on well-known identities or images of yourself.
Talking avatar companies like HeyGen and Synthesia simply do lip sync on top of previously recorded videos. We know, because we used the platforms and analyzed the videos.
[Left] Source video provided to HeyGen (partial segment; zoomed-in for clarity).
[Right] Synthetic AI video generated by HeyGen.
The benefit of lip syncing is that users can quickly generate entire scenes and body movements at high resolution (since the returned videos are just a dubbed segment of the originally recorded video). The drawback is that videos often look uncanny, since the actor's facial expressions and gestures don't actually reflect the emotion or cadence of what's being said. When analyzing the video outputs, we found that one interesting trick these companies use is to extend the source video with a mirrored (in time) version of itself. This allows users to create videos of any length (e.g. even if the provided video is only 1 minute long). However, it can also add to the extreme uncanniness of the movements.
Source of videos: heygen.com
Some test audios and images are from Echomimic, EMO, and Loopy Avatar et al. Thanks to these teams!