Lesson 2

Percent-Encoding Rules

How `%XX` represents bytes and which characters must be encoded.

Percent-encoding writes a % followed by exactly two hexadecimal digits for each escaped octet:

byte 32 (space)  → %20
byte 38 (&)       → %26
byte 231 (decimal) → hex E7 → %E7   (meaning depends on decoding context!)

Letters A–F may appear in upper or lower case (%e7 ≡ %E7). Interoperability is better when tooling is consistent, but parsers should accept mixed case.

One escape = one byte (not “one UTF-16 code unit”)

When you encode Unicode strings, encode the octets of UTF-8 unless a scheme explicitly dictates otherwise:

π (U+03C0)

UTF-8 bytes: CF 80  → %CF%80
Misleading mental model (wrong): encode the code point digits “03C0” in hex
Correct model: serialize to bytes first, escape each problematic byte as %HH

Which bytes need escaping?

Rough student-level checklist:

Bytes whose graphic character would interact with URI separators (?, &, #, /, :, @, [, ]) when those bytes belong to opaque data rather than punctuation.
Bytes outside ASCII for broad compatibility paths.
Control bytes, spaces, DEL, and ambiguous whitespace when they are literal data inside parameters.

RFC 3986 defines generic rules plus normalized behaviours; frameworks may tighten them (for example disallowing stray % tails).

Invalid escapes

Incomplete sequences confuse parsers:

%           → malformed (no digits)
%AB         → malformed (needs two hex nibbles)
%GG         → invalid hex digits

Good libraries validate or reject malformed input instead of silently “best guessing.”

Over-encoding and double encoding

Applying percent-encoding twice is a recurring bug:

literal value: 100%

first pass:    100%25     (literal percent becomes %25)
mistaken pass: 100%2525   (oops—encoded already-safe % again)

If your server receives %2525, it often decodes twice into a single %—not what you intended for display data. Decide once whether a subsystem expects encoded or decoded text and keep that invariant clear.

Normalisation tips

Canonicalisation includes choices such as:

Upper-case vs lower-case hexadecimal (both valid).
Choosing between encoding a punctuation character vs leaving it literal when legally unambiguous inside a component.
Normalising Unicode to NFC before encoding UTF-8 (optional but avoids duplicate spellings diverging servers).

Consistency across your gateways (CDN vs app vs database) avoids subtle mismatches during signature validation or caching.

← Back to course overview