Load balancing multiple CDNs or how jsDelivr works
November 8, 2013
Don’t Know jsDelivr? You will.
If you’re not in the know, jsDelivr is a free and public CDN that offers hosting for developers at no cost. It’s a lot like Google Hosted Libraries, but with more projects and freedom. Anyone can submit a library for hosting on this CDN. It’s fast with multiple points of presence around the globe — 81, currently.
But how is this possible?
The combined resources of two CDN providers and 14 VPS servers make up jsDelivr’s 81 locations. For a full list of providers, read the about page.
When I built jsDelivr my main concerns were speed and uptime. I feel responsible for every website that uses it. I want everything to work, even in the worst case scenarios by eliminating SPOFs (single point of failures). To that end, jsDelivr was developed from many hours of research, configuring and tweaking.
Here’s a diagram to illustrate how jsDelivr works:
Each file originates in our Github repository. All committed files are automatically uploaded to my origin server, a small VPS running Nginx. I’m using a service called DeployHQ and I’ve also created a webhook in my Github repo.
Pull requests for new files are automatically uploaded via SFTP to my origin. Requests to the origin are made by MaxCDN.
The first time MaxCDN requests a file, it’s fetched from the origin VPS. It’s cached and not requested again, unless I perform a manual purge. If the origin goes down, it’s not an issue because all files are cached by MaxCDN.
The rest of the servers, including a backup pull zone on another CDN, uses MaxCDN as the origin and caches the files. If MaxCDN goes offline, the other servers have my files cached.
For the purposes of visualization, I have all the VPS servers grouped together in the diagram. However each VPS is an independent entity, has Nginx installed in transparent proxy configuration and maintain a separate file cache. There is no synchronization or interaction between them. Each requested file is automatically downloaded from MaxCDN and cached.
For monitoring I install Munin-Node on each server. I use the default Munin-Node config but I delete information outside the scope of this discussion and add monitoring metrics for Nginx. To utilize Munin-Master I use a free service called HostedMunin. It’s very useful and eliminates the need to manage VPSs and services.
If a VPS goes down — and they do — the system automatically removes it from the pool and sends the users to the server that’s next in line for performance. But more on that later.
For nameservers, jsDelivr uses Akamai DNS which eliminates the worry of the nameserver going down. They also provide excellent performance, since it uses their huge infrastructure.
jsDelivr uses Cedexis to load balance. Cedexis offers DNS-based load balancing according to your own needs; it allows you to load any external data and create your own load balancing algorithm. But in the case of jsDelivr, we are going to use performance and uptime as our criteria.
Cedexis gathers information on all major CDNs and cloud hosting providers. They claim they do over 1.3 billion measurements per day.
They have separate information for HTTP, HTTPS and mobile on:
HTTP connect time
HTTP response time
These public measurements are known as community data and any Cedexis user can use them in their own DNS load balancing algorithms and applications. jsDelivr uses the community data generated for MaxCDN.
In addition, you can collect private RUM (real user measurements) data, which can be used using the Cedexis API. This is very useful if you have private servers and want to use them for your applications.
First, create a private “platform” for each server/DC/CDN you want to monitor. During configuration, you’ll enter URLs that direct to test files the users will download to collect performance data. This is how it looks:
When creating a platform, you can also perform uptime checks. These checks will be administered by Cedexis and the data can be accessed from your applications. This feature is called Cedexis Sonar.
For my use of jsDelivr, I’ve created 14 custom platforms to monitor the performance and uptime of each of my VPS servers.
The data is stored in your account. You can access it in same way as community data:
Now that we have our data, we can start load balancing by creating a “DNS application”.
By default Cedexis includes three installed applications: “Optimal Round Trip Time”, “Round Robin” and “Static Routing”. Enter your platforms into the application, and you’ll have a basic load balancing setup. It’s good enough for basic load balancing, but not for jsDelivr.
It’s worth mentioning at this point that all DNS applications are written in PHP. When you create a custom application you will be required to enter a fallback CNAME or IP and upload the actual PHP code. Fallback is very important. All users will be sent to it if the application fails to output a response. An application can fail because of syntax or logic errors, so be careful when writing your algorithms.
If you now want to take a look on the actual application code, check out the Cedexis’ developer page.
Here’s how load balancing for jsDelivr works and what it can do for you. First, I register all providers or platforms and enter the correct CNAME for each. This CNAME will be given to the user later.
jsDelivr uses a penalty system for some providers. For example, the default value for all providers is 1.0. So if a provider has 100ms latency to Canada but the value is adjusted to 0.5, the performance data for this provider is 50ms. You can add different penalties per country if you want.
This is very useful in many cases. For example, I can add a small penalty to CDN.NET globally to give a small boost to local VPS servers. I also adjust MaxCDN to be a bit faster for the USA since they have the best infrastructure and I want to take advantage of that.
Keep in mind that not all data is correct and not all data is equal. Since MaxCDN has community data with 1.3 billion measurements per day its hard to compare it to a VPS in New York that gets 100,000 measurements per day. If 90,000 of those people are from NY and 10,000 are from other location, the New York VPS will have better performance for the US than MaxCDN, which is not correct. I artificially raise MaxCDN’s performance to compensate for these issues.
public $providers = array( 'cdn.net_b' => array( 'cname' => '531151672.r.worldcdn.net', 'penalty' => 1.3 ), 'maxcdn' => array( 'cname' => 'jsdelivr3.dak.netdna-cdn.com', 'country_penalties' => array( 'US' => 0.6 ) ), 'jack-it' => array( 'cname' => 'jack-it.jsdelivr.net' ), ... );
jsDelivr registers static rules with country overrides. This means that it doesn’t matter what the performance is, all users from predefined countries will always connect to a pre-selected provider.
For example, I know that leap-ua VPS has better performance in the Ukraine than any other CDN or VPS. For this reason I created the following rule:
public $country_overrides = array( 'UA' => 'leap-ua' );
You can also create rules with overrides based on ASNs if you want.
public $asn_overrides = array( '36331' => 'maxcdn' );
Now I define a few variables:
// The thresholds (%) below which we consider a CDN unavailable private $availability_threshold = 90; private $sonar_threshold = 75; private $ttl = 20; public $reasons = array( 'A', // RTT 'B', // Country override 'C', // ASN override 'D', // Single available candidate 'E', // None available - random selection 'F', // No RTT data for available candidates - random selection );
$availability_threshold is the uptime data collected by live users from either the community data pool or your own private measurements. If there are less than 90% successful requests per hour, the provider is considered down.
$sonar_threshold is the uptime data collected by Cedexis Sonar system. In my case it takes measurements every minute for all providers. This data is used as backup for availability_threshold. If, for any reason, the live user data is wrong (because of not enough measurements or other reasons), this ensures we don’t respond with offline providers.
$ttl is the DNS TTL we send to the user. I use a TTL of 20 seconds so all users receive updated information.
$reasons is self explanatory. It’s used for debugging, what action was taken and the reasons for those actions.
1. Declare data inputs. This is where we get our data. jsDelivr uses the following:
HTTP Response Time
EDNS based Country and ASN
Most of you should be familiar with this data, except maybe EDNS. DNS servers (such as Cedexis) can’t collect actual user IP addresses. They only know the IP of the DNS resolver the user used. This isn’t a problem if the resolver is a local ISP. But if someone from Brazil used Google DNS with an IP of 188.8.131.52, Cedexis would mark the user as based in USA and serve the wrong data. This where EDNS is useful. Most public DNS resolvers, including Google’s 184.108.40.206, support EDNS. If it does, it sends the real user IP address along with the rest of their data. This way CDNs, and Cedexis, know the real IP address of a user and can now route them to the correct destination.
jsDelivr checks if EDNS is enabled and reads the correct IP address. If EDNS isn’t available, it fallbacks to reading the IP address of the DNS resolver. (Hopefully the user is using the correct resolver.)
2. Loop through the declared providers and remove providers that:
Don’t have a correct CNAME declared
Don’t meet the User Availability threshold
Don’t meet the Sonar Availability threshold
3. The providers (or candidates) that meet all the above criteria are then used to match any existing static rules like Country or ASN overrides. It’s done at this stage because we don’t want the overrides to break our failover system. Static rules apply ONLY for providers that are currently online.
The user information is checked against the declared country and ASNs. If one of them matches, the CNAME associated with the provider is reported and the process dies.
4. At this point, we want to make sure we have enough information to work with. We will find out how many providers we have left after cleaning the availability threshold. If only one provider is left, the CNAME output is returned and the process dies. If we have zero providers left, we select a random one — however, this should be VERY rare.
5. If none of the static rules worked, we loop through the providers, get their performance data and apply any penalties we might have declared for them. It’s also important to note that this step occurs almost all the time. We take into account the user’s country and any other information. Then we asort our array with providers based on their RTT performance and output the fastest.
What does this all mean? jsDelivr always selects the faster possible provider for each individual user. Uptime is taken very seriously and all precautions are taken to make sure jsDelivr wont go down.
If the origin VPS, goes down MaxCDN still has all the files cached.
If MaxCDN goes down, then Cedexis will take it out of rotation and load balance the users across VPSs and CDN.NET — both of which have all files cached.
A jsDelivr failure would require two enterprise CDNs to go down, plus 14 different hosting providers
Nothing happens if Cedexis goes down. The CNAME of cdn.jsdelivr.net will change from the Cedexis hostname to MaxCDN’s.
And here you go. You have now learned how exactly how jsDelivr works, how to load balance multiple CDNs yourself and what information to take into account.