
Microsoft VibeVoice Deep Dive: The Voice AI That Understands a Full Hour in One Shot
Originally published on NextFuture What is VibeVoice? Microsoft quietly dropped one of the most impressive open-source voice AI projects of 2025–2026: VibeVoice . It is a family of frontier-grade models that handle both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) — and the engineering behind it is genuinely novel. VibeVoice is not a wrapper around Whisper. It is a ground-up rethink of how voice AI should work at scale. The Family Tree VibeVoice-TTS-1.5B — Long-form multi-speaker TTS (up to 90 min, 4 speakers). Accepted as an ICLR 2026 Oral. Code was temporarily removed due to misuse; community forks exist. VibeVoice-Realtime-0.5B — Lightweight streaming TTS. First audio in ~300ms. Supports 9 languages + 11 English style voices. Now in HuggingFace Transformers v5.3. VibeVoice-ASR-7B — 60-minute single-pass speech recognition with speaker diarization, timestamps, and multilingual support (50+ languages). The star of this article. Core Innovation: Continuous Speech Tokeniz
Continue reading on Dev.to React
Opens in a new tab

