RFC 3986: Web Encoding Hell
January 30th, 2013Have you ever had problems with %20 vs + spacing encoding when using external HTTP APIs? We've faced this problem, multiple times. Our own REST API faced this problem. But don't be afraid, there's a reason for it, and there are plenty of solutions out there.
So, what is RFC 3986?RFC 3986 is the URI (Unified Resource Identifier) Syntax document, in this document you can find with very profound detail everything you will ever want to know about how URIs should be written and how they should be read.
But, if there's a detailed specification, what's the problem with it?In the last couple of years, modern web browsers have been pushing towards more compact URI schemes. So every modern web browser encodes spaces as + instead of %20, while every other character is still encoded using Percent Encoding. As this practice became more and more popular, url encoding libraries for the different programming languages followed the same approach, leading to the inconsistent state we are at currently.
Percent Encoding: RFC 3986 - Section 2.1
A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component. A percent-encoded octet is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing that octet's numeric value. For example, "%20" is the percent-encoding for the binary octet "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space character (SP).This method allows us to have any kind of content included inside our URI in a safe way. But when this standard is not followed 100%, some inconsistencies might occur and things will stop working eventually.