
How Text Becomes Binary: Character Encoding From ASCII to UTF-8
A few years ago I was building an API that accepted user input in multiple languages. Everything worked fine in English. Then a user submitted a form in Japanese and the database stored garbage characters. The classic mojibake problem. I had assumed UTF-8 everywhere but one middleware component was silently converting to Latin-1, which cannot represent Japanese characters. Fixing it took ten minutes. Finding it took two days. Understanding how text becomes binary -- and how that binary becomes text again -- is fundamental to avoiding an entire category of bugs that are notoriously difficult to debug. ASCII: Where It Started ASCII (American Standard Code for Information Interchange) was published in 1963. It maps 128 characters to 7-bit binary numbers: Character Decimal Binary A 65 1000001 B 66 1000010 Z 90 1011010 a 97 1100001 0 48 0110000 space 32 0100000 newline 10 0001010 Some useful patterns to notice: Uppercase letters start at 65, lowercase at 97. The difference is exactly 32, wh
Continue reading on Dev.to Webdev
Opens in a new tab




