FlareStart
HomeNewsHow ToSources
FlareStart

Where developers start their day. All the tech news & tutorials that matter, in one place.

Quick Links

  • Home
  • News
  • Tutorials
  • Sources
  • Privacy Policy

Connect

© 2026 FlareStart. All rights reserved.

Back to articles
ArticleMachine Learning

Let's build the GPT Tokenizer

via Andrej KarpathyAndrej Karpathy2y ago

The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely. Chapters: 00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues 00:05:50 tokenization by example in a Web UI (tiktokenizer) 00:14:56 strings in Python, Unicode code points 00:18:15 Unicode byte encodin

Watch on Andrej Karpathy

Opens in a new tab

Watch on YouTube
22 views

Related Articles

Kids and young people: stay curious and be willing to engage with others
Article

Kids and young people: stay curious and be willing to engage with others

freeCodeCamp.org • 1d ago

I really miss coding.
Article

I really miss coding.

Theo • 1d ago

Here's a cool and easy way to work with colors in your Three.js projects
Article

Here's a cool and easy way to work with colors in your Three.js projects

freeCodeCamp.org • 2d ago

Learn a bit about camera position in Three.js
Article

Learn a bit about camera position in Three.js

freeCodeCamp.org • 5d ago

A great novel can be written in any language - just like great programs.
Article

A great novel can be written in any language - just like great programs.

freeCodeCamp.org • 6d ago

Discover More Articles