
Tokens
π§ Tokens in Transformers β Developer Notes πΉ What is a Token? A token is the smallest unit of text that a transformer model processes. It is created by a tokenizer and then converted into numerical IDs before entering the model. β οΈ Important: Token β always a full word. πΉ What Can Be a Token? Depending on the tokenizer, a token may be: whole word subword (most common) character punctuation special symbol β Modern transformers mainly use subword tokenization . πΉ Example Sentence: I like eating apples Possible subword tokens: [I] [like] [eat] [##ing] [apple] [##s] πΉ Transformer Processing Pipeline Raw Text β Tokenizer β Tokens β Token IDs β Embeddings β Transformer Neural networks only understand numbers, so tokens must be converted to IDs and then to vectors. πΉ Why Tokenization Is Needed Tokenization helps to: reduce vocabulary size handle unknown words capture morphology improve generalization enable efficient training πΉ Special Tokens (Encoder Models) Typical encoder input: [CLS] I li
Continue reading on Dev.to
Opens in a new tab




