How Unicode Actually Works - CharacterCodes Blog

Section 01

Before Unicode: Pure Chaos

Imagine you’re writing a letter in 1989. You type on a French computer, your colleague opens the file on a Japanese machine, and half the letters turn into random symbols. This happened all the time.

The root cause was simple: there was no agreement on which number represents which letter. ASCII covered 128 characters – enough for English, but it left the rest of the world out in the cold. Then came dozens of competing standards: Latin-1, Shift-JIS, KOI8-R, Big5… each one a different dialect of a language nobody could agree on.

⚠️

The Mojibake problem “Mojibake” (文字化け) is Japanese for “character transformation” — it describes the garbled text you get when a file encoded in one system is decoded with another. If you’ve ever opened a file and seen “ï¿½ï¿½” everywhere, you’ve met mojibake.

In 1987, engineers at Apple and Xerox started designing a single universal standard. The Unicode Consortium was incorporated in 1991, and Unicode 1.0 shipped with 7,161 characters. Today, Unicode 17.0 defines over 150,000 characters spanning 161 modern and historic scripts.

Section 02

What Is a Code Point?

Here’s the core idea: Unicode is just a giant, agreed-upon lookup table. Every character in the universe gets a unique number. That number is called a code point.

😀

Character: GRINNING FACE
Code Point: U+1F600
Decimal: 128512
Block: Emoticons
Plane: Supplementary Multilingual (1)
Category: So (Other Symbol)

The U+ prefix is just notation — it means “Unicode code point.” The number that follows is in hexadecimal (base-16). So U+0041 is decimal 65, which is the letter A.

notation
# The letter ‘A’
U+0041 ← hex 41 ← decimal 65

# Greek lowercase pi (π)
U+03C0 ← hex 3C0 ← decimal 960

# The grinning face emoji 😀
U+1F600 ← hex 1F600 ← decimal 128512

Unicode can theoretically address up to 1,114,112 code points (written U+000000 to U+10FFFF). Think of it as a post office that pre-assigned every possible address in a city, even ones that don’t have buildings yet.

💡

Code point ≠ byte This is the most important thing to keep in your head as you read on. A code point is just an abstract number — an ID. How that number actually gets stored on disk or sent over a network is a separate question, answered by encodings like UTF-8 (covered in Section 5).

Section 03

Blocks & Scripts

Code points aren’t scattered randomly. Unicode organizes them into blocks: contiguous ranges of code points that belong to a related group. Think of a block as a zip code — it tells you roughly which neighborhood a character lives in.

Block Name	Range	Sample	Size
Basic Latin	U+0000–U+007F	A B C 1 2 ! @	128
Latin-1 Supplement	U+0080–U+00FF	Æ Ñ ü ç	128
Greek and Coptic	U+0370–U+03FF	α β π Ω	144
Arabic	U+0600–U+06FF	ا ب ت ل	256
CJK Unified Ideographs	U+4E00–U+9FFF	中文世界	20,902
Emoticons	U+1F600–U+1F64F	😀 😄 😭 😍	80
Musical Symbols	U+1D100–U+1D1FF	𝄞 𝄢 𝄪	256

A script is a slightly different concept — it’s the writing system itself (Latin, Arabic, Hangul, Devanagari). One script can span multiple blocks, and one block can technically contain characters from multiple scripts, though Unicode tries hard to keep things tidy.

📖

The CJK Unified Ideographs block is enormous Chinese, Japanese, and Korean share many characters that look identical or nearly identical. Rather than encoding them separately, Unicode “unified” them into one block. This is sometimes controversial — purists argue they’re distinct — but it saved tens of thousands of code points.

Section 04

The 17 Planes

Unicode’s 1,114,112 code points are organized into 17 planes, each containing 65,536 (= 2¹⁶) code points. Think of a plane as a floor in a very tall building, and each floor is a 256×256 grid of apartments.

PLANE 0

BMP

Basic Multilingual Plane. 99% of all text you’ll ever encounter.

PLANE 1

SMP

Supplementary Multilingual. Historic scripts, math symbols, emoji.

PLANE 2

SIP

Supplementary Ideographic. Rare & historic CJK ideographs.

PLANE 3

TIP

Tertiary Ideographic. Very rare CJK characters.

PLANES 4–13

Unassigned

Reserved for future use.

PLANE 14

SSP

Supplementary Special-purpose. Tag and selector characters.

PLANES 15–16

Private Use

Reserved for private agreements. Not standardized.

“The BMP is the ground floor of the Unicode building. Most of humanity lives here. The other 16 floors exist for historians, mathematicians, and emoji enthusiasts.”

The Basic Multilingual Plane (BMP) spans U+0000 to U+FFFF. It holds virtually every character used in modern writing: all Latin-script languages, Arabic, Hebrew, Devanagari, CJK basics, currency symbols, punctuation, and more.

Anything above U+FFFF is called a supplementary character. Most emoji live in Plane 1. Ancient scripts like Linear B and Egyptian hieroglyphs also live there, along with all of mathematical notation.

Section 05

Encodings: UTF-8, UTF-16, UTF-32

A code point is an abstract idea. To actually store or transmit text, you need an encoding — a recipe that converts code point numbers into bytes. Unicode has three main encodings.

UTF-32: the simple one

UTF-32 uses exactly 4 bytes per code point, always. Simple math, no tricks. The downside? English text is 4× larger than it needs to be, since every ASCII letter wastes 3 bytes of zeroes.

UTF-8: the clever one

UTF-8 is the encoding of the web. It uses 1 to 4 bytes per code point, depending on the code point’s value. Crucially, the 128 ASCII characters take exactly 1 byte each — so a pure-ASCII file in UTF-8 is identical to an ASCII file.

1 byte

0xxxxxxx

2 bytes

110xxxxx
10xxxxxx

3 bytes

1110xxxx
10xxxxxx
10xxxxxx

4 bytes

11110xxx
10xxxxxx
10xxxxxx
10xxxxxx

The leading bits of each byte act as a “road sign” telling a decoder: how many bytes to read? A byte starting with 0 is a single-byte character. A byte starting with 110 is the start of a 2-byte sequence. Bytes starting with 10 are continuation bytes — never the start of a character.

example
# ‘A’ (U+0041) → 1 byte in UTF-8
01000001
└─ starts with 0: single byte

# ‘©’ (U+00A9) → 2 bytes in UTF-8
11000010 10101001
└─ starts with 110: 2-byte sequence
└─ starts with 10: continuation byte

# ‘😀’ (U+1F600) → 4 bytes in UTF-8
11110000 10011111 10011000 10000000
└─ starts with 11110: 4-byte sequence

UTF-16: the legacy one

UTF-16 uses 2 bytes for characters in the BMP and 4 bytes for supplementary characters. It’s used internally by JavaScript, Java, and Windows. Its complexity — especially the surrogate pair mechanism — is the price paid for keeping BMP characters at 2 bytes.

Encoding	Bytes per char	ASCII-efficient?	Used by
UTF-8	1–4	✅ Yes (1 byte)	Web, Linux, macOS files
UTF-16	2 or 4	❌ No (2 bytes)	JavaScript, Java, Windows APIs
UTF-32	4, always	❌ No (4 bytes)	Internal processing, some databases

Section 06

Surrogates: UTF-16’s Clever Trick

Here’s a puzzle: UTF-16 uses 2 bytes (16 bits) per character. That gives you 65,536 possible values. But Unicode has over a million code points. How do you squeeze a million into 65,536 slots?

The answer is surrogate pairs. Unicode deliberately reserved two blocks in the BMP for this purpose:

Block	Range	Role
High Surrogates	U+D800–U+DBFF	First half of a surrogate pair
Low Surrogates	U+DC00–U+DFFF	Second half of a surrogate pair

When UTF-16 needs to encode a supplementary character (above U+FFFF), it breaks it into two 16-bit values: a high surrogate followed by a low surrogate. Together, the pair encodes the original code point. Neither half on its own is a valid character — they only work as a team.

surrogate pair math
# Encode U+1F600 (😀) in UTF-16

Step 1: Subtract 0x10000 from the code point
0x1F600 – 0x10000 = 0xF600

Step 2: Express as 20-bit binary
0xF600 = 0000 1111 0110 0000 0000

Step 3: Split into two 10-bit halves
High 10 bits: 00 0011 1101 = 0x03D
Low 10 bits: 10 0000 0000 = 0x200

Step 4: Add surrogate bases
High surrogate: 0xD800 + 0x03D = 0xD83D
Low surrogate: 0xDC00 + 0x200 = 0xDE00

# Result: 😀 = D83D DE00 in UTF-16

⚡

Why do surrogates matter to you? JavaScript strings are UTF-16 under the hood. This means emoji.length often returns 2 for a single emoji — because it’s a surrogate pair. The string "😀" has a length of 2 in JavaScript, not 1. This surprises developers constantly.

      javascript gotcha

“😀”.length          // → 2  (surprise!)

[…“😀”].length      // → 1  (spread uses code points)

“😀”.codePointAt(0) // → 128512  (correct code point)

“😀”.charCodeAt(0)  // → 55357   (high surrogate only!)

Section 07

Code Points ≠ Characters (Grapheme Clusters)

You might think: one code point = one visible character. Close, but not always true. Some of what we see as a single character is actually composed of multiple code points working together.

👩‍💻

Visible: 1 character
Code Points: 3 code points
UTF-8 bytes: 11 bytes
Breakdown: 👩 + ZWJ + 💻

The “woman technologist” emoji is actually three code points joined by an invisible glue character called a Zero Width Joiner (ZWJ, U+200D). The rendering engine sees the ZWJ and knows to merge the adjacent emoji into one image.

These visual units are called grapheme clusters. A grapheme cluster can contain:

A base character + combining diacritics (e.g., é = é, two code points)
An emoji + skin tone modifier (e.g., 👍🏽 = two code points)
A sequence of emoji joined by ZWJ (family emoji can be 7+ code points)
A flag emoji (two Regional Indicator letters, e.g., 🇺🇸 = US flag)

⚠️

String length is meaningless without context If you need to count “characters” that a user sees, you need a grapheme cluster algorithm (like Intl.Segmenter in modern JavaScript). Counting bytes, code units, or even raw code points will give you the wrong answer for complex emoji.

Section 08

TL;DR Cheat Sheet

Term	Plain English	Example
`Code Point`	A unique number assigned to every character	`U+1F600`
`Block`	A named range of related code points	Emoticons block
`Plane`	One of 17 groups of 65,536 code points each	BMP = Plane 0
`BMP`	The first plane; holds almost all modern text	U+0000 – U+FFFF
`Encoding`	Rules for turning code points into bytes	UTF-8, UTF-16
`UTF-8`	Variable-width encoding; 1–4 bytes; default for web	😀 = F0 9F 98 80
`Surrogate Pair`	Two UTF-16 values that together encode one supplementary code point	D83D + DE00 = 😀
`Grapheme Cluster`	One visible character as perceived by a human	👩‍💻 = 3 code points
`ZWJ`	Zero Width Joiner – invisible glue between emoji	U+200D

🎯

The one-sentence summary: Unicode gives every character a unique number (code point). Encodings like UTF-8 decide how to store those numbers as bytes. The BMP holds most everyday text; emoji and historic scripts live in supplementary planes. UTF-16 uses surrogate pairs to reach those higher planes. And what you see as one character might be several code points fused together.