Google DeepMind CEO Demis Hassabis revealed plans to integrate the tech giant’s Gemini AI models with its video-generating system, Veo.
The move, he said, is aimed at enhancing the models’ grasp of the real world was disclosed in a recent episode of Possible, a podcast co-hosted by LinkedIn founder Reid Hoffman.
“We’ve always built Gemini, our foundation model, to be multimodal from the beginning,” Hassabis explained. “We have a vision for a universal digital assistant — one that can actually help you in the real world.”
The announcement comes as major players in the AI industry increasingly pursue “omnimodal” systems — artificial intelligence capable of processing and generating multiple forms of media. Google’s latest Gemini models can now create images, audio, and text, while OpenAI’s ChatGPT also supports native image generation. Amazon, too, has teased the launch of an “any-to-any” model by year’s end.
These advancements rely on massive amounts of data from diverse media types, including text, audio, images, and video. According to Hassabis, Veo’s training data appears to come largely from YouTube — a platform owned by Google.
“Basically, by watching YouTube videos — a lot of YouTube videos — [Veo 2] can figure out, you know, the physics of the world,” he said.
While Google has previously told TechCrunch that its AI models “may be” trained on “some” YouTube content, this is reportedly done in accordance with agreements with content creators. The company is also believed to have updated its terms of service last year, partly to expand the scope of data it can use for AI training.