RFC 3986: Web Encoding Hell
January 30, 2013 | David Litvak
Have you ever had problems with %20 vs + spacing encoding when using external HTTP APIs? We’ve faced this problem, multiple times. Our own REST API faced this problem. But don’t be afraid, there’s a reason for it, and there are plenty of solutions out there.
So, what is RFC 3986?
RFC 3986 is the URI (Unified Resource Identifier) Syntax document, in this document you can find with very profound detail everything you will ever want to know about how URIs should be written and how they should be read.
But, if there’s a detailed specification, what’s the problem with it?
In the last couple of years, modern web browsers have been pushing towards more compact URI schemes. So every modern web browser encodes spaces as + instead of %20, while every other character is still encoded using Percent Encoding.
As this practice became more and more popular, url encoding libraries for the different programming languages followed the same approach, leading to the inconsistent state we are at currently.
Percent Encoding: RFC 3986 – Section 2.1
A percent-encoding mechanism is used to represent a data octet in a component when that octet’s corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component. A percent-encoded octet is encoded as a character triplet, consisting of the percent character “%” followed by the two hexadecimal digits representing that octet’s numeric value. For example, “%20” is the percent-encoding for the binary octet “00100000” (ABNF: %x20), which in US-ASCII corresponds to the space character (SP).
This method allows us to have any kind of content included inside our URI in a safe way. But when this standard is not followed 100%, some inconsistencies might occur and things will stop working eventually.
The OAuth Problem
OAuth is a widely used token based protocol for authentication which is defined in the RFC 5849.
As the RFC specifies, OAuth uses strict Percent-Encoding for signing requests, which makes libraries that use Plus-Encoding fail as it does not follow the standard.
Programming Languages and Encoding
The last year we’ve been working hard on releasing our new REST API and libraries to use it from multiple programming languages. As our API has it’s authentication being done through OAuth, we had to face this encoding drama several times.
Dealing with Python
As unbelievable as it sounds Python’s standard library urllib has an unresolved bug that will not be fixed in any future 2.7.x release, and is scheduled to be pushed for next 3.x version, but still isn’t approved.
For fixing this, we had to roll out our own fork of the popular HTTP library requests which handles this encoding problem gracefully.
Dealing with Ruby
When writing our Ruby library, we’ve faced a huge amount of trouble, OAuth libraries not following the RFCs and URL Encoding being broken too in the standard library.
After writing around 10 versions of the library, trying out every OAuth library out there we found Signet a really good OAuth 1.0/2.0 library which came to our rescue. But we still had problems with encoding, Ruby’s URI module doesn’t handle Percent-Encoding the way it should, so our quest for a Percent-Encoding aware URI module began, after trying a bunch of them, we came across with 2 promising ones Faraday::URI and Addressable::URI, finally deciding for Addressable::URI as it’s default behavior is to Percent-Encode which simplified the code greatly and solved our problem.
Have you ever faced this problems in your favorite language? Please tell us about it in the comments section.