
Before Unicode: Pure Chaos
Imagine you’re writing a letter in 1989. You type on a French computer, your colleague opens the file on a Japanese machine, and half the letters turn into random symbols. This happened all the time.
The root cause was simple: there was no agreement on which number represents which letter. ASCII covered 128 characters – enough for English, but it left the rest of the world out in the cold. Then came dozens of competing standards: Latin-1, Shift-JIS, KOI8-R, Big5… each one a different dialect of a language nobody could agree on.
The Mojibake problem “Mojibake” (文字化け) is Japanese for “character transformation” — it describes the garbled text you get when a file encoded in one system is decoded with another. If you’ve ever opened a file and seen “��” everywhere, you’ve met mojibake.
In 1987, engineers at Apple and Xerox started designing a single universal standard. The Unicode Consortium was incorporated in 1991, and Unicode 1.0 shipped with 7,161 characters. Today, Unicode 17.0 defines over 150,000 characters spanning 161 modern and historic scripts.
What Is a Code Point?
Here’s the core idea: Unicode is just a giant, agreed-upon lookup table. Every character in the universe gets a unique number. That number is called a code point.
- Character
- GRINNING FACE
- Code Point
- U+1F600
- Decimal
- 128512
- Block
- Emoticons
- Plane
- Supplementary Multilingual (1)
- Category
- So (Other Symbol)
The U+ prefix is just notation — it means “Unicode code point.” The number that follows is in hexadecimal (base-16). So U+0041 is decimal 65, which is the letter A.
# The letter ‘A’
U+0041 ← hex 41 ← decimal 65
# Greek lowercase pi (π)
U+03C0 ← hex 3C0 ← decimal 960
# The grinning face emoji 😀
U+1F600 ← hex 1F600 ← decimal 128512
Unicode can theoretically address up to 1,114,112 code points (written U+000000 to U+10FFFF). Think of it as a post office that pre-assigned every possible address in a city, even ones that don’t have buildings yet.
Code point ≠ byte This is the most important thing to keep in your head as you read on. A code point is just an abstract number — an ID. How that number actually gets stored on disk or sent over a network is a separate question, answered by encodings like UTF-8 (covered in Section 5).
Blocks & Scripts
Code points aren’t scattered randomly. Unicode organizes them into blocks: contiguous ranges of code points that belong to a related group. Think of a block as a zip code — it tells you roughly which neighborhood a character lives in.
| Block Name | Range | Sample | Size |
|---|---|---|---|
| Basic Latin | U+0000–U+007F | A B C 1 2 ! @ | 128 |
| Latin-1 Supplement | U+0080–U+00FF | Æ Ñ ü ç | 128 |
| Greek and Coptic | U+0370–U+03FF | α β π Ω | 144 |
| Arabic | U+0600–U+06FF | ا ب ت ل | 256 |
| CJK Unified Ideographs | U+4E00–U+9FFF | 中 文 世 界 | 20,902 |
| Emoticons | U+1F600–U+1F64F | 😀 😄 😭 😍 | 80 |
| Musical Symbols | U+1D100–U+1D1FF | 𝄞 𝄢 𝄪 | 256 |
A script is a slightly different concept — it’s the writing system itself (Latin, Arabic, Hangul, Devanagari). One script can span multiple blocks, and one block can technically contain characters from multiple scripts, though Unicode tries hard to keep things tidy.
The CJK Unified Ideographs block is enormous Chinese, Japanese, and Korean share many characters that look identical or nearly identical. Rather than encoding them separately, Unicode “unified” them into one block. This is sometimes controversial — purists argue they’re distinct — but it saved tens of thousands of code points.
The 17 Planes
Unicode’s 1,114,112 code points are organized into 17 planes, each containing 65,536 (= 216) code points. Think of a plane as a floor in a very tall building, and each floor is a 256×256 grid of apartments.
“The BMP is the ground floor of the Unicode building. Most of humanity lives here. The other 16 floors exist for historians, mathematicians, and emoji enthusiasts.”
The Basic Multilingual Plane (BMP) spans U+0000 to U+FFFF. It holds virtually every character used in modern writing: all Latin-script languages, Arabic, Hebrew, Devanagari, CJK basics, currency symbols, punctuation, and more.
Anything above U+FFFF is called a supplementary character. Most emoji live in Plane 1. Ancient scripts like Linear B and Egyptian hieroglyphs also live there, along with all of mathematical notation.
Encodings: UTF-8, UTF-16, UTF-32
A code point is an abstract idea. To actually store or transmit text, you need an encoding — a recipe that converts code point numbers into bytes. Unicode has three main encodings.
UTF-32: the simple one
UTF-32 uses exactly 4 bytes per code point, always. Simple math, no tricks. The downside? English text is 4× larger than it needs to be, since every ASCII letter wastes 3 bytes of zeroes.
UTF-8: the clever one
UTF-8 is the encoding of the web. It uses 1 to 4 bytes per code point, depending on the code point’s value. Crucially, the 128 ASCII characters take exactly 1 byte each — so a pure-ASCII file in UTF-8 is identical to an ASCII file.
10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx
The leading bits of each byte act as a “road sign” telling a decoder: how many bytes to read? A byte starting with 0 is a single-byte character. A byte starting with 110 is the start of a 2-byte sequence. Bytes starting with 10 are continuation bytes — never the start of a character.
# ‘A’ (U+0041) → 1 byte in UTF-8
01000001
└─ starts with 0: single byte
# ‘©’ (U+00A9) → 2 bytes in UTF-8
11000010 10101001
└─ starts with 110: 2-byte sequence
└─ starts with 10: continuation byte
# ‘😀’ (U+1F600) → 4 bytes in UTF-8
11110000 10011111 10011000 10000000
└─ starts with 11110: 4-byte sequence
UTF-16: the legacy one
UTF-16 uses 2 bytes for characters in the BMP and 4 bytes for supplementary characters. It’s used internally by JavaScript, Java, and Windows. Its complexity — especially the surrogate pair mechanism — is the price paid for keeping BMP characters at 2 bytes.
| Encoding | Bytes per char | ASCII-efficient? | Used by |
|---|---|---|---|
| UTF-8 | 1–4 | ✅ Yes (1 byte) | Web, Linux, macOS files |
| UTF-16 | 2 or 4 | ❌ No (2 bytes) | JavaScript, Java, Windows APIs |
| UTF-32 | 4, always | ❌ No (4 bytes) | Internal processing, some databases |
Surrogates: UTF-16’s Clever Trick
Here’s a puzzle: UTF-16 uses 2 bytes (16 bits) per character. That gives you 65,536 possible values. But Unicode has over a million code points. How do you squeeze a million into 65,536 slots?
The answer is surrogate pairs. Unicode deliberately reserved two blocks in the BMP for this purpose:
| Block | Range | Role |
|---|---|---|
| High Surrogates | U+D800–U+DBFF | First half of a surrogate pair |
| Low Surrogates | U+DC00–U+DFFF | Second half of a surrogate pair |
When UTF-16 needs to encode a supplementary character (above U+FFFF), it breaks it into two 16-bit values: a high surrogate followed by a low surrogate. Together, the pair encodes the original code point. Neither half on its own is a valid character — they only work as a team.
# Encode U+1F600 (😀) in UTF-16
Step 1: Subtract 0x10000 from the code point
0x1F600 – 0x10000 = 0xF600
Step 2: Express as 20-bit binary
0xF600 = 0000 1111 0110 0000 0000
Step 3: Split into two 10-bit halves
High 10 bits: 00 0011 1101 = 0x03D
Low 10 bits: 10 0000 0000 = 0x200
Step 4: Add surrogate bases
High surrogate: 0xD800 + 0x03D = 0xD83D
Low surrogate: 0xDC00 + 0x200 = 0xDE00
# Result: 😀 = D83D DE00 in UTF-16
Why do surrogates matter to you? JavaScript strings are UTF-16 under the hood. This means emoji.length often returns 2 for a single emoji — because it’s a surrogate pair. The string "😀" has a length of 2 in JavaScript, not 1. This surprises developers constantly.
“😀”.length // → 2 (surprise!)
[…“😀”].length // → 1 (spread uses code points)
“😀”.codePointAt(0) // → 128512 (correct code point)
“😀”.charCodeAt(0) // → 55357 (high surrogate only!)
Code Points ≠ Characters (Grapheme Clusters)
You might think: one code point = one visible character. Close, but not always true. Some of what we see as a single character is actually composed of multiple code points working together.
- Visible
- 1 character
- Code Points
- 3 code points
- UTF-8 bytes
- 11 bytes
- Breakdown
- 👩 + ZWJ + 💻
The “woman technologist” emoji is actually three code points joined by an invisible glue character called a Zero Width Joiner (ZWJ, U+200D). The rendering engine sees the ZWJ and knows to merge the adjacent emoji into one image.
These visual units are called grapheme clusters. A grapheme cluster can contain:
- A base character + combining diacritics (e.g.,
é= é, two code points) - An emoji + skin tone modifier (e.g., 👍🏽 = two code points)
- A sequence of emoji joined by ZWJ (family emoji can be 7+ code points)
- A flag emoji (two Regional Indicator letters, e.g., 🇺🇸 = US flag)
String length is meaningless without context If you need to count “characters” that a user sees, you need a grapheme cluster algorithm (like Intl.Segmenter in modern JavaScript). Counting bytes, code units, or even raw code points will give you the wrong answer for complex emoji.
TL;DR Cheat Sheet
| Term | Plain English | Example |
|---|---|---|
Code Point |
A unique number assigned to every character | U+1F600 |
Block |
A named range of related code points | Emoticons block |
Plane |
One of 17 groups of 65,536 code points each | BMP = Plane 0 |
BMP |
The first plane; holds almost all modern text | U+0000 – U+FFFF |
Encoding |
Rules for turning code points into bytes | UTF-8, UTF-16 |
UTF-8 |
Variable-width encoding; 1–4 bytes; default for web | 😀 = F0 9F 98 80 |
Surrogate Pair |
Two UTF-16 values that together encode one supplementary code point | D83D + DE00 = 😀 |
Grapheme Cluster |
One visible character as perceived by a human | 👩💻 = 3 code points |
ZWJ |
Zero Width Joiner – invisible glue between emoji | U+200D |
The one-sentence summary: Unicode gives every character a unique number (code point). Encodings like UTF-8 decide how to store those numbers as bytes. The BMP holds most everyday text; emoji and historic scripts live in supplementary planes. UTF-16 uses surrogate pairs to reach those higher planes. And what you see as one character might be several code points fused together.

