Less than a year ago, Microsoft’s VASA-1 blew my mind. The company showed how it could animate any photo and turn it into a video featuring the person in the image. This wasn’t the only impressive part, as the subject of the image would also be able to speak in the video.
VASA-1 surpassed anything we’d seen back then. This was April 2024, when we had already seen Sora, OpenAI’s text-to-video generation tool that would not be released until December. Sora did not feature similarly advanced face animation and audio synchronization technologies.
Unlike OpenAI, Microsoft never intended to make VASA-1 available to the project. I said then that a public tool like VASA-1 could harm, as anyone could create misleading videos of people saying whatever the creator conceives. Microsoft’s research project also indicated that it would be only a matter of time before others could develop similar technology.
Now, TikTok parent company ByteDance has developed an AI tool called OmniHuman-1 that can replicate what VASA-1 did while taking things to a whole new level.
The Chinese company can take a single photo and turn it into a fully animated video. The subject in the image can speak in sync with the provided audio, similar to what the VASA-1 examples showed. But it gets crazier than that. OmniHuman-1 can also animate body part movements and gestures, as seen in the following examples.
The similarities to VASA-1 shouldn’t be surprising. The Chinese researchers mention on the OmniHuman-1’s research page that they used VASA-1 as a template, and even took audio samples from Microsoft and other companies.
According to Business Standard, OmniHuman-1 uses multiple input sources simultaneously, including images, audio, text, and body poses. The result is a more precise and fluid motion synthesis.
ByteDance used 19,000 hours of video footage to create OmniHuman-1. That’s how they were able to teach the AI to create video sequences that are almost indiscernible from real video footage. Some of the samples above are practically perfect. In others, it’s clear that we’re looking at AI generating movement, especially the subject’s mouth.
The Albert Einstein speech in the clip above is certainly a highlight for OmniHuman-1. Taylor Swift singing the theme song from the anime Naruto in Japanese in the video below is another example of OmniHuman-1 in action:
OmniHuman-1 can be used to create AI-generated videos showing human subjects (real or fabricated) speaking or singing in all sorts of instances. This opens the service for abuse, as I’m sure some people, including malicious actors, would use the service to impersonate celebrities for scams or misleading purposes.
OmniHuman-1 also works well for animating cartoon and video game characters. This could be a great use for the technology, as it could help creators more accurately animate facial expressions and speech for such characters.
Also interesting is the claim that OmniHuman-1 can generate videos of unlimited length. The examples available range between five and 25 seconds. The memory is apparently a bottleneck, not the AI’s ability to create longer clips.
Business Standard points out that ByteDance’s OmniHuman-1 is an expected development from the Chinese company. ByteDance also unveiled INFP recently, an AI project aimed to animate facial expressions in conversations. ByteDance is also well-known for its CapCut editing app, that was removed from app stores alongside TikTok a few weeks ago.
It’s only natural to see ByteDance expand its AI video generation capabilities and introduce services like OmniHuman-1.
It’s unclear when OmniHuman-1 will be availabel to users, if ever. ByteDance has a website at this link where you can read more details about the AI research project and see more samples.
ByteDance researchers also mention “ethics concerns” in the document, which is great to see. This signals that ByteDance might take a more cautious approach to deploying the product, though I’m just speculating here.
But if OmniHuman-1 is released in the wild too soon, it’ll only be a matter of time before someone creates lifelike videos of real-life celebrities or made-up humans who say (or sing) anything the creator wants them to, in any language. And it won’t always be just for entertainment purposes.