Lesson 2
Percent-Encoding Rules
How `%XX` represents bytes and which characters must be encoded.
Percent-encoding writes a % followed by exactly two hexadecimal digits for each escaped octet:
byte 32 (space) → %20
byte 38 (&) → %26
byte 231 (decimal) → hex E7 → %E7 (meaning depends on decoding context!)
Letters A–F may appear in upper or lower case (%e7 ≡ %E7). Interoperability is better when tooling is consistent, but parsers should accept mixed case.
One escape = one byte (not “one UTF-16 code unit”)
When you encode Unicode strings, encode the octets of UTF-8 unless a scheme explicitly dictates otherwise:
π (U+03C0)
UTF-8 bytes: CF 80 → %CF%80
Misleading mental model (wrong): encode the code point digits “03C0” in hex
Correct model: serialize to bytes first, escape each problematic byte as %HH
Which bytes need escaping?
Rough student-level checklist:
- Bytes whose graphic character would interact with URI separators (
?,&,#,/,:,@,[,]) when those bytes belong to opaque data rather than punctuation. - Bytes outside ASCII for broad compatibility paths.
- Control bytes, spaces, DEL, and ambiguous whitespace when they are literal data inside parameters.
RFC 3986 defines generic rules plus normalized behaviours; frameworks may tighten them (for example disallowing stray % tails).
Invalid escapes
Incomplete sequences confuse parsers:
% → malformed (no digits)
%AB → malformed (needs two hex nibbles)
%GG → invalid hex digits
Good libraries validate or reject malformed input instead of silently “best guessing.”
Over-encoding and double encoding
Applying percent-encoding twice is a recurring bug:
literal value: 100%
first pass: 100%25 (literal percent becomes %25)
mistaken pass: 100%2525 (oops—encoded already-safe % again)
If your server receives %2525, it often decodes twice into a single %—not what you intended for display data. Decide once whether a subsystem expects encoded or decoded text and keep that invariant clear.
Normalisation tips
Canonicalisation includes choices such as:
- Upper-case vs lower-case hexadecimal (both valid).
- Choosing between encoding a punctuation character vs leaving it literal when legally unambiguous inside a component.
- Normalising Unicode to NFC before encoding UTF-8 (optional but avoids duplicate spellings diverging servers).
Consistency across your gateways (CDN vs app vs database) avoids subtle mismatches during signature validation or caching.