Back to articles
I built the algorithm behind ChatGPT from scratch — here's what I learned

I built the algorithm behind ChatGPT from scratch — here's what I learned

via Dev.toQueued_

By Tushar Singla | First Year BTech CSE (AI/ML) Student "What I cannot create, I do not understand." — Richard Feynman That quote hit different when I was staring at my screen at 2am, watching my tokenizer learn the word "the" by merge #17. Let me explain. The origin story Every time you type something into ChatGPT, Claude, or any LLM, something happens before the AI even sees your message. Your text gets tokenized . "Hello, how are you?" → [15496, 11, 703, 389, 345, 30] Those numbers are what the model actually sees. Not your words. And the thing doing this conversion? A tokenizer . I wanted to understand exactly how it works. Not from a YouTube video. Not from a HuggingFace tutorial. I wanted to build one myself, from scratch, in pure Python. So I did. Meet TewToken — a bilingual BPE tokenizer trained on English + Hindi text, built with zero ML libraries. pip install git+https://github.com/tusharinqueue/tewtoken.git Wait, what even is tokenization? Computers do not understand languag

Continue reading on Dev.to

Opens in a new tab

Read Full Article
5 views

Related Articles