Understanding TDT: The Mechanism Behind the Fastest Models on the Open ASR Leaderboard

TL;DR: The Token-and-Duration Transducer (TDT) extends RNN-T by jointly predicting what token to emit and how many frames that token covers. This lets the model skip multiple encoder frames per step during inference instead of advancing one at a time, yielding up to 2.82x faster decoding with comparable or better accuracy. Word Error Rate (WER) is a useful metric to try to optimise, but if your model takes 10 seconds to transcribe 1 second of audio, nobody's shipping it. The Huggingface Open ASR Leaderboard tracks both accuracy and speed. At the time of writing, in the huggingface top 10, Nvidia's Parakeet TDT models are more than 3x ahead of the nearest competition in RTFx (Inverse Real Time Factor/Throughput, i.e. how many seconds of audio the model can process per second of wall-clock time). These models are significantly faster than the competition while maintaining competitive WERs. The mechanism? A modification to the RNN-Transducer called the Token-and-Duration Transducer (TDT)

Understanding TDT: The Mechanism Behind the Fastest Models on the Open ASR Leaderboard

Related Articles

Comments that outlived errors

Programming for Pleasure: Sudoku-11

I Ranked 30 Energy Drinks, From Celsius to Ghost (2025)

Power BI Masterclass — Weekly Highlights 2026–09

Marshall Kilburn III Review: A Classic Rock Bluetooth Speaker

Related Articles

News
Comments that outlived errors
Medium Programming • 2h ago

News
Programming for Pleasure: Sudoku-11
Medium Programming • 2h ago

News
I Ranked 30 Energy Drinks, From Celsius to Ghost (2025)
Wired • 3h ago

News
Power BI Masterclass — Weekly Highlights 2026–09
Medium Programming • 3h ago

News
Marshall Kilburn III Review: A Classic Rock Bluetooth Speaker
Wired • 3h ago