
Claude Feels Slow. But Is Moving a Team to Open-Weight Models Actually the Fix?
TL;DR Claude has a real speed problem for our team — but mostly in TTFT , not in raw decoding speed. I measured our actual usage and found this: TTFT p50: 4.2s–6.8s TTFT p90: 14.5s–28.1s Claude Sonnet decode p50: 176 tok/s That explains the feeling: Claude often isn’t that slow once it starts , but sometimes it takes so long to begin that the whole thing feels like it’s crawling. That naturally raises the next question: Should we move the team to self-hosted open-weight models? At first glance, that sounds promising. Self-hosted setups can have dramatically better TTFT. In the numbers I looked at, open-weight deployments were often estimated around 150–600ms TTFT , versus Claude’s 4–7s median in our real usage. But once I looked at the actual team setup — 10 engineers sharing one GPU budget — the answer stopped looking obvious. The best open-weight models need serious multi-GPU infra , and once that infra is shared, the speed case starts looking surprisingly shaky. So this post is not
Continue reading on Dev.to
Opens in a new tab



