• Home
  • Microsoft unveils three models for…

Microsoft unveils three models for text, speech, images

Microsoft sharepoint cyberattack hits 400 victims - Researchers

Microsoft AI, the company’s research division, on Thursday unveiled three foundational artificial intelligence models capable of generating text, speech and images.

The move underscores Microsoft’s determination to expand its own multimodal AI capabilities and strengthen its position against competing AI labs, despite its ongoing partnership with OpenAI.

MAI-Transcribe-1 converts speech into text across 25 languages and is 2.5 times faster than Microsoft’s Azure Fast service, according to a company press release.

MAI-Voice-1 is an audio generation model that can produce up to 60 seconds of sound in just one second and enables users to create custom voices.

MAI-Image-2 is a video generation model.

The models were developed by Microsoft’s MAI Superintelligence team, an artificial intelligence research unit led by Mustafa Suleyman, the Chief Executive Officer of Microsoft AI.

The team was formed and officially announced in November 2025.

“At Microsoft AI, we’re building Humanist AI. We have a distinct view when creating our AI models — putting humans at the center, optimizing for how people actually communicate, training for practical use,” Suleyman wrote in the blog post. “You’ll see more models from us soon in Foundry and directly in Microsoft products and experiences.”

Despite launching its own models, Suleyman reaffirmed Microsoft’s commitment to its partnership with OpenAI in an interview with VentureBeat.

Microsoft has invested more than $13 billion in the AI lab and integrates its models across various products under a multi-year partnership.

The company adopts a similar approach with chips, developing some in-house while also sourcing from external suppliers.