Building a Tokenizer from Scratch [part 2]

Parser Theory: Q/A with Claude Opus In part 1 , we built a working FSM that recognizes <div>text</div> using just 7 primitives mapped 1:1 to assembly opcodes. But FSMs have a hard limit: they can't handle nested structures like <div><div>hello</div></div> . In this post, we climb the Chomsky hierarchy from finite state machines to pushdown automata , build a PDA that recognizes nested <div> tags, and then turn it into a transducer that emits tokens. In other words we are building the core of a lexer . Q: Why can't FSMs handle nested structures? Because an FSM has a fixed number of states , and that's all the memory it has. Consider nested divs: <div><div><div>hello</div></div></div> To correctly match closing tags, you need to count how many <div> s you've opened so you know how many </div> s to expect. An FSM with, say, 12 states can handle nesting up to some fixed depth — but someone can always write HTML nested one level deeper than your states can track. Put simply: 1 level deep →

Building a Tokenizer from Scratch [part 2]

Related Articles

Building a Runtime with QuickJS

I can't stop talking about the Ninja Creami Swirl - and it's on sale at Amazon right now

Do Beginners Still Search "How to Code"?

How to Become a Software Developer After 12th?

Claude Code Essentials

Related Articles

How-To
Building a Runtime with QuickJS
Lobsters • 1h ago

How-To
I can't stop talking about the Ninja Creami Swirl - and it's on sale at Amazon right now
ZDNet • 2h ago

How-To
Do Beginners Still Search "How to Code"?
Medium Programming • 2h ago

How-To
How to Become a Software Developer After 12th?
Medium Programming • 3h ago

How-To
Claude Code Essentials
FreeCodeCamp • 3h ago