
Characters, Bytes, and Code Points: Why String Length Is Never Simple
Pop quiz. What does "hello".length return in JavaScript? Five, obviously. Now what does "cafe\u0301".length return? If you said 5, you're right. The string looks like "cafe" with an accent on the e, rendering as "caf\u00e9." But it's 5 characters, not 4, because the accent is a separate combining character. And "caf\u00e9".length returns 4, even though it looks identical on screen. Two strings that look the same, render the same, and compare as equal in some contexts have different lengths. Welcome to Unicode. This is why building a character counter -- the kind you'd use for checking tweet length or meta description limits -- is surprisingly non-trivial once you step outside ASCII. Characters vs. code points vs. grapheme clusters The word "character" is ambiguous in computing. There are at least three things it can mean: Code units are the individual values in a string's underlying encoding. In JavaScript (UTF-16), each code unit is 16 bits. String.length returns the number of UTF-16
Continue reading on Dev.to Tutorial
Opens in a new tab




