How we improved our API response time by 95%
May 23rd, 2014MaxCDN has always used a Perl-based system for provisioning zones to various Points of Presence (POPs) throughout the cache network. The current system started to creak as our client base grew: the provisioning happened on a single thread, and blocked on I/O operations. Creating a new node is primarily I/O bound (as data and settings are setup on disk), with relatively light activity for the CPU. On average, new zones took about 10 seconds to provision. Not bad, right -- why change? Unfortunately, because the requests were synchronous, they could pile up, one behind the other. In some cases a newly-issued provisioning request could take up to 5 minutes to complete -- not a great experience for the end user sitting on the other end of the control panel. Seeing the delay, a user might suspect an error, refresh the page, enter the zone details again, and kick off another provisioning request that also gets stuck in the queue. You can see where this is going. We decided to move the provisioning process to an API-driven process, and had to decide among a few implementation languages:
- NodeJS is asynchronous-by-default, which suited the problem domain. Provisioning is more like “start the job, let me know when you’re done” than a traditional C-style program that’s CPU-bound and needs low-level efficiency.
- NodeJS acts as an HTTP-based service, so exposing the API was trivial
Architecture DetailsHere’s a few details on the new architecture, and tips on how to apply similar changes to your own system.
- Be asynchronous. The major gains came from avoiding the need to make blocking filesystem I/O requests as incoming requests came in. Again, choosing NodeJS meant we had this architecture strategy out of the box. Having several simultaneous I/O operations queued lets the operating system figure out how to allocate resources (its speciality), vs the programmer. Fire off the requests and let the OS sort ‘em out.
- The fastest code is no code. As we rebuilt the API, we noticed the previous provisioning system ran a configuration check against every zone on a server, which could vary from 1 to 15 seconds. The new API just checks the configuration on the zone being provisioned, which usually completes in under 250ms. When a legacy system is being redesigned, question the assumptions that may no longer apply.
- Be even more asynchronous. The original API performed a synchronous Nginx reload after provisioning a zone, which often took up to 30 seconds or longer. While important, this step shouldn’t block the response to the user (or API) that a new zone has been created, or block subsequent requests to adjust the zone. With the new API, an independent worker reloads Nginx configurations based on zone modifications.It’s like ordering a product online: don’t pause the purchase process until the product’s been shipped. Say the order has been created, and you can still cancel or modify shipping information. Meanwhile, the remaining steps are being handled behind the scenes. In our case, the zone provision happens instantly, and you can see the result in your control panel or API. Behind the scenes, the zone will be serving traffic within a minute.
- What gets measured, gets improved. How do you know what parts of the workflow need improvement? Measure it. With New Relic in place, we have graphs of our API performance and can directly see if a server or zone is causing trouble, and the impact of our changes. There’s no comparison between a real-time performance graph and “Strange, the site seems slow, I should tail the logs”.
- Handle failures gracefully. Moving to an asynchronous workflow gives you a chance to re-examine failure scenarios. In our case, the earlier API was overly optimistic about operations like database updates, and might return a successful response when a silent failure had occurred. Additionally, it would send errors like a Nginx reload failure inline, as part of an individual provisioning response. This behavior was changed to send a global alert if the Nginx reload failed (which can impact several zones, not just the API request that issued the request).