The craziest AI development I’ve seen all year is Microsoft’s VASA-1 technology. The company developed AI models that can transform a single image of a person with an audio file into a moving video of that person speaking. The demos were mindblowing, though VASA-1 isn’t available as a commercial product. It might never be, as people can easily abuse this kind of AI tool.
VASA-1 was shown off in mid-April. Now, almost two months later, Google Deepmind unveiled a similar AI technology. It doesn’t have a commercial name, with Google describing it as video-to-audio (V2A) technology. That also means it’s not a commercial AI product you can try out yourself.
V2A lets you generate audio from a single text prompt to match a silent video clip. Google’s demos are mind-blowing.
The video-to-audio tool “makes synchronized audiovisual generation possible,” as Google explains in a blog. Google offered plenty of examples to showcase the V2A tech. Some of them are included below, complete with the prompts Google used to generate the audio for the videos.
Prompt for audio: Cinematic, thriller, horror film, music, tension, ambience, footsteps on concrete
“V2A combines video pixels with natural language text prompts to generate rich soundscapes for the on-screen action,” Google says, pointing out that V2A can be paired with Veo. That’s the video generation model that Google unveiled at I/O 2024. Veo is a direct competitor to OpenAI’s Sora and other similar products.
Google says V2A tech can offer “a dramatic score, realistic sound effects or dialogue that matches the characters and tone of a video.” The tech can be used to make soundtracks, and Google offers one very exciting potential use: Video-to-audio could add sound to silent films, which would be incredible.
Prompt for audio: A drummer on a stage at a concert surrounded by flashing lights and a cheering crowd
However, voice generation isn’t perfect, as Google explains later in the blog. While V2A won’t require you to manually align the audio and video, there are limitations, especially when it comes to speech:
We’re also improving lip synchronization for videos that involve speech. V2A attempts to generate speech from the input transcripts and synchronize it with characters’ lip movements. But the paired video generation model may not be conditioned on transcripts. This creates a mismatch, often resulting in uncanny lip-syncing, as the video model doesn’t generate mouth movements that match the transcript.
Prompt for audio: Music, Transcript: “this turkey looks amazing, I’m so hungry”
Google also says it’s looking for feedback from the creative community on the video-to-audio tech to ensure V2A will have a positive impact. To prevent abuse, Google is adding its SynthID toolkit to the V2A research to watermark AI-generated content.
It’s unclear when V2A will be available to the public, with Google saying the new tech will undergo rigorous testing. To see what’s possible with V2A at the current stage of development, you’ll find more demo clips at this link.