
Andrej Karpathy's microGPT Architecture — Complete Guide
High-Level Overview 1. Data Loading and Preprocessing The script begins by ensuring input.txt exists, defaulting to a dataset of names. Each line (name) is treated as an individual document and shuffled so the model learns character patterns — not a fixed ordering. if not os . path . exists ( ' input.txt ' ): # downloads names.txt ... docs = [ l . strip () for l in open ( ' input.txt ' ). read (). strip (). split ( ' \n ' ) if l . strip ()] 2. The Tokenizer — Text to Numbers This is not a fancy library tokenizer. It finds every unique character in the text and uses that as the vocabulary. uchars = sorted ( set ( '' . join ( docs ))) BOS = len ( uchars ) # Beginning of Sequence token (also acts as End-of-Sequence) A special BOS token is added — it serves as both the start signal during generation and the stop signal when it's sampled as output. Example: "emma" → [BOS, e, m, m, a, BOS] → [26, 4, 12, 12, 0, 26] 3. Embeddings — Numbers to Meaningful Vectors Each token ID gets two 16-dimens
Continue reading on Dev.to Tutorial
Opens in a new tab



