Back to articles
I Made a Single CUDA Kernel Speak: Streaming Qwen3-TTS at 50ms Latency on an RTX 5090

I Made a Single CUDA Kernel Speak: Streaming Qwen3-TTS at 50ms Latency on an RTX 5090

via Dev.toJayanth Kumar

My first measurement said 35,932 milliseconds . The target was 90. That's not a typo. Thirty-five seconds to produce the first chunk of audio from a text-to-speech system that was supposed to feel like a natural conversation. I was off by a factor of 400. And I had less than a day to fix it. Here's how I went from knowing nothing about CUDA megakernels, nothing about TTS pipelines, and nothing about Pipecat — to streaming real-time speech synthesis at 50ms TTFC and 0.17 RTF on a single RTX 5090. With 3 lines of kernel code changed. Self Challenge The self task was deceptively simple on paper: use AlpinDale's qwen_megakernel , a ~1,200-line CUDA program that runs Qwen3-0.6B text generation at 1,000 tokens/second on an RTX 5090 and make it run Qwen3-TTS speech synthesis inside a Pipecat voice agent pipeline. Two hard targets: TTFC (time to first audio chunk): < 90 ms — how long before the user hears anything RTF (real-time factor): < 0.3 — generating 1 second of speech must take less tha

Continue reading on Dev.to

Opens in a new tab

Read Full Article
0 views

Related Articles