How we built our Real-time Analytics Platform
June 18, 2014 | Bryan Conklin
With the recent release of our Analytics Platform, we would like to give you a behind the scenes look of how we built it.
The amount of log data our worldwide network produces is staggering. Every day, our servers generate over 8 terabytes of raw log data (before compression). During our peak times, our servers are generating over 200,000 requests per second. As we grow, we will see this number continue to climb. It was imperative that we build a solution that would scale quickly based on demand. With that said here were the project requirements we had to fulfill:
- Handle Current And Future Insert Volume
- Handle Unstructured Request Data
- Minimal Hardware with Redundancy And Scale
- Minimal Software Layers
- Immediate Query Capabilities For Customers
We wanted to rebuild our reporting and analytics platform to provide more informative metrics to our customers to assist them in making smart business decisions when it comes to their online content. By building a stronger platform that gathers more metrics, we can now deliver more rich reporting through our control panel and give our customers the power to retrieve this information through our API to power their own needs. Our initial release features a rich API customers can use to retrieve the raw access log data we collect for their content. This includes what content, who asked for it and how did we handle the request. For our customers who are not developers, we also have released the Log Viewer which provides the same level of visibility into your access logs, but with a nice graphical interface.
Because of our unique requirements, we could not simply pick something up off the shelf and plug it in. Our Engineering team developed a new reporting stack that uses a myriad of tools. The foundation of our platform is powered by TokuMX which is a specialized version of the popular MongoDB NoSQL Database. TokuMX provides us the unique flexibility of MongoDB with accelerated speed for inserting data as well as compression of our data. We worked closely with Tokutek (the creators of TokuMX) to ensure that the platform we built could handle the volume of data coming in and still retain an acceptable speed of data access.
In order to insert data as quickly as possible into TokuMX, we wrote custom agents in Go. Go provides us the ability to write blazingly fast applications that leverage multiple core CPUs without the traditional hassles that come with parallel processing strategies in other languages such as C++, Java, etc.
The bridge between our CDN and our database cluster is powered by a high-volume messaging queue built using Redis. Redis is lightweight enough to keep up with the speed of incoming data. We provide redundancy in our messaging layer by creating a layer of redundant servers in case of any point of failure.
On the API side, we needed to develop a platform that provides maximum concurrency for incoming reporting requests as well as be easy for our entire engineering team to maintain. We chose to write our API layer in Node.js using the popular Express framework. By using Node.js + Express, we’ve been able to build a foundation for our API layer which is easy to add functionality such as highly granular queries with regex matching and scale to high demand.
In order to ensure maximum redundancy and flexibility, our reporting and analytics cluster consists of 20 shards in our Dallas datacenter with a high level of redundancy by replicating to a failover set. By replicating our data, we provide a high speed of data access as well as redundancy in case of primary server failure. By creating clusters with designated responsibilities (database storage, worker cluster, api cluster), we can scale each out based on needs. All of our database servers have a RAID10 configuration using SSDs for storage. Because TokuMX is extremely memory driven, we have 128 gigs of RAM available on each database server.
Our CDN is constantly serving requests and with our new reporting and analytics platform, we have strived for real-time reporting. As requests come into our edge, they are immediately shuttled off into our incoming message queue for processing through our network. Every request is logged within seconds and kept for five days. The best part is we’re storing this data for all our customers from our small business owners to our enterprise customers. In the future, we will start providing real time aggregated statistics for our customers such as bandwidth consumed, number of requests and more.
By leveraging MongoDB’s scalable design, our database storage layer can horizontally scale consistently with our increasing growth. In recent tests, we have shown with even our existing infrastructure we can consume over 600,000 events per second in real-time.
The biggest problem of our solution was finding a way to effectively ingest at least 200,000 requests per second and storing them in real time. By building scalable worker models written in Go, we’ve managed to take full advantage of multi-core CPU’s with 64 gigs of memory to rapidly parse incoming metrics and properly shuttle them off to TokuMX/MongoDB for storage. When we originally designed this, we kept the goal of 200,000 requests per second in mind. By building a solution that can scale even farther, we’ve ensured a platform that will grow with us.
TokuMX has given us the ability to ingest hundreds of thousands of records per second and store it in MongoDB while giving us the 80% compression required to not break the bank on SSD storage. This could have been done with vanilla MongoDB; however, TokuMX has given us the ability to do it with less computing hardware and storage, which was part of the requirements.
It was imperative that we started from a single server environment to determine initial performance numbers. By testing a single server environment, we were able to process up to 75,000 events per second with no issue. We know from this point by scaling out horizontally, we’d reach our magic metric of 200,000 events per second and be able to go even further.
We looked into using various “Big Data” service providers as well as existing Big Data off the shelf solutions such as Splunk or Logstash. However, we recognized that we know our data best and how to manage it. By building a custom solution based on fantastic open source software, we had created a platform that is easy to scale and empowers our customers with extremely rich metrics.
Want to engineer things like this? Cool, we are hiring.