How AdRoll Has Scaled To Process 1.8 Trillion Requests Per Month
March 25, 2015 | Stephen Dionne
MaxCDN is proud to present #MaxScale, our new series that takes an in-depth look at how growing tech companies handle scale at the highest levels of bandwidth, storage, analysis and more. In this installment, we chat with AdRoll’s CTO Valentino Volonghi.
In 2007, a team of four people founded advertising startup AdRoll. Established and headquartered in San Francisco, CA, AdRoll has grown tremendously over the last eight years, setting the standard for retargeting platforms worldwide. The company now employs over 500 people in offices across San Francisco, New York, Sydney, Tokyo and Dublin.
Valentino Volonghi, AdRoll’s CTO
AdRoll collects and analyzes customer data to help deliver high-performance marketing campaigns to advertisers in companies of all sizes. In 2009, AdRoll refocused its efforts on ad retargeting and remains a powerful ad retargeting platform today.
In 2013, AdRoll acquired data analytics startup Bitdeli, a GitHub analytics engine that also started in San Francisco. AdRoll integrated the Bitdeli technology into its platform and brought the Bitdeli talent onto its staff.
Valentino Volonghi is AdRoll’s CTO and a founding member of the company. Among his other duties, Valentino designs and implements the globally distributed, self-healing infrastructures and systems used by AdRoll. He also founded the Italian Python Association and serves as its current president. Valentino is an active contributor in the open source community and specializes in distributed systems.
Stephen Dionne: As AdRoll has expanded, what is the biggest challenge you’ve faced in scaling your IT operations? How did you successfully (or unsuccessfully) face that challenge?
Valentino Volonghi: The biggest problem is finding a way to innovate at the same speed we could when we were a small startup with a handful of people. We’re a team of 80 people now, so it’s hard to reach that same level of speed and agility.
In tackling this problem, you need to ask yourself, “How do you organize your infrastructure and deployment cycles so that everyone in the department can operate independently, without having to worry too much about coordination with each other?” The best solution is to try to make sure that every team is equipped to operate its own tools and infrastructure. Make sure that the infrastructure is stable. Make sure that everyone is using the same deployment cycles and the same tools, but those tools don’t need to be tied to each other. We have met with some success in implementing this solution, but we’re not yet at the level I’d like us to be. There are still a few technical challenges that need to be overcome.
One important component we use to help solve the coordination cost problem is using the cloud. It’s very easy for one of our teams to requisition a machine for deployment without having to coordinate with the operations team. So the cloud makes things very streamlined for experiments, rollouts, even disaster recovery.
Another challenge is dealing with the large amount of data we collect and use in order to improve our algorithms. In a compressed format, we generate more than a petabyte of data every month. There are other companies that store large amounts of data like we do, such as Dropbox or Pinterest. But those companies don’t typically access all of their data all of the time. In our line of business, everything that we generate is going to be accessed multiple times a day, every day. So we need to be able to scale the speed of our infrastructure to handle the amount of data that we generate. For AdRoll, this is a particularly complicated challenge.
To give you an idea of exactly how much data we’re talking about, consider this comparison. AdRoll generates about 150 terabytes of uncompressed data every day. So across three days, that’s over 450 terabytes of data. In three days, we exceed the yearly annual data output of every U.S. stock exchange combined. We’re handling a massive amount of data. It’s taken some work and a few iterations, but our system is now capable of scaling to any amount of data we need to ingest. The constant flux of data is no longer an issue for us, and we can instead think about how to use that data to benefit our customers.
SD: Walk me through the technology stack that AdRoll is built on. How did you choose each major piece?
VV: Each of our teams is sort of independent and has the freedom to choose its own technologies. That being said, for the bulk of our use cases, we have decided to adopt four main languages. We use Python for the web application side, Java for data processing, the D Programming Language for our machine learning infrastructure and Erlang for all of our realtime application needs.
A picture of the San Francisco office. In addition to its office in San Francisco, AdRoll also has offices in New York, Dublin, Sydney, London and Tokyo.
Along with these languages, we also have different types of databases we prefer to use, depending on the specific case. We use Presto DB for short-term, high-detail data review. For example, we may want to explore the past day or the last week’s worth of data in extreme detail without any kind of summarization or aggregation. For longer-term reviews, we built our own technology solution after acquiring Bitdeli. This database is aggregated and doesn’t offer as much detail, but it can go back multiple years. We also have an Hbase database for our general reporting on the user side. It’s mostly run using open-source libraries that we know well and enjoy using.
SD: Tell me a little about the tech providers with whom you have developed long-term relationships. What motivates you to continue these particular relationships?
VV: The biggest vendor we work with is Amazon. We’ve been working together since about 2007. The main reason we use Amazon is due to scale. In advertising, we need to be everywhere in the world at the speed of light, especially when it comes to bidding. The speed of light in a vacuum is 300,000 km/s. In the Earth’s atmosphere, light travels 200,000 km/s. So it takes a ray of light about 60 milliseconds to go from New York to Paris and back. For us, even that’s too long a time. In advertising, we have to be able to react within 100 milliseconds to any bidding request that comes from anywhere in the world – Asia, Europe, the Americas, Australia, everywhere. It’s very important that our infrastructure spans the entire globe.
For all this, we need a single API from which we can manage everything. Many companies build this system on their own, but it’s a daunting project for a small startup in a very competitive market. Even before you can start to compete, you’re asked to build four data centers. For a small company with a handful of employees and without a lot of capital, that’s just too much to ask. So we started working with Amazon back in the beginning when we needed a global span right away. Amazon allowed us to easily reach every corner of the world. Our time is better spent adding value to our customers than building a piece of background infrastructure.
We have also worked for a long time with a company called Datadog. Datadog provides monitoring and metrics for our entire infrastructure. We have a distributed, realtime system that runs throughout the world and handles millions of dollars daily. If anything were to go wrong, it would cost a significant amount of money.
To put it another way, we handle 60 billion requests per day. If we make an error 1% of the time, that amounts to 600 million errors. If you consider that a thousand requests cost between $1 to $2, we could be seeing $600,000 to $1.2 million of missed ad impressions just because 1% of the time we had an error. So it’s a significant risk and expenditure, and all of our machines need to have complete control and automated alerts. It’s simply impossible for humans to keep track of everything happening as it happens. Our systems generate more than 100,000 metrics every second, and they get processed in real time by Datadog. Our relationship continues to grow simply because we get more value out of the service than we pay for it.
The last tech provider that we use is Tableau Software. We use Tableau’s dashboards for our BI infrastructure, which give our BI analytics the ability to have an incredible level of introspection into the company. Without Tableau, it would be very hard to run AdRoll. There are simply too many moving parts. By having a system that allows our analysts to build their own graphs, dashboards and monitoring, our analysts can quickly get to the data and provide solutions once a problem has been identified.
SD: What prompted your company to focus on ad retargeting, and what technical challenges did you face during that transition?
VV: In 2008, the economy took its big downturn, and we saw that we were going to have to adapt in order to survive in the new marketplace. In response, we kept a strict focus on projects that would add measurable value to our customers. Ad retargeting was still nascent back then, and it was one of several different product experiments that we tried. We found that existing ad retargeting products were not very sophisticated at that point, and we saw a market need for an ad retargeting product whose value was easily measured. So we pounced on that opening. We halted all development on the old AdRoll products and focused entirely on building the full-fledged ad retargeting platform that we have today.
AdRoll workers help turn customer data into high performance marketing.
Fortunately we had a lot of technical assets already. For one thing, our company had always been in the advertising business, so we had very advanced and scalable systems in place. Developing the user experience and the sophisticated algorithms running in the background were two of our chief concerns.
All in all, the biggest technical challenge since the refocus has been on sustaining growth. Of course you want to be able to develop other things as well, but this business is going at a thousand miles per hour. It’s tough to balance your focus between adding new products, generating value, and making sure the system isn’t overwhelmed as more and more customers sign up.
SD: Tell us about your acquisition of Bitdeli from an IT standpoint. What challenges did you face integrating Bitdeli into your own platform? What benefits have you seen since the acquisition?
VV: We had a very clear need when we acquired Bitdeli. We needed to be better at inspecting our data and at being able to tell what’s going on in our internal systems at multiple different layers. Back then, the only BI analysis capability we had was something simple that we had built ourselves. But it was not nearly good enough for the level of detail and granularity we wanted.
AdRoll workers at the Mission Street offices in San Francisco.
Since we didn’t have anything big or complicated to begin with, it was fairly easy to get Bitdeli integrated with the way we process our logs and data. It took us about eight months to finish the development of the product behind Bitdeli, which is a database called Daily Road. After those eight months, we went from having a relatively limited ability to analyze the status of the business to having a complete, extremely detailed view of all the important metrics that drive AdRoll’s daily performance.
So we have experienced a very good return on investment in acquiring Bitdeli, and the Bitdeli team is really happy with us as well. The Bitdeli team continues working on the product and adding new features. Besides working on our new infrastructure, the team also contributes toward external products for our customers.
SD: What has been your biggest technological challenge in scaling to handle billions of ad impressions every month? How have you addressed it?
VV: A few big issues arise when you’re trying to scale. In one sense, scaling ad impressions isn’t that hard. You just add more hardware or optimize the software, and eventually you’re able to meet the new demand. The really challenging component is cost. You don’t want to spend all of your money right away, and you don’t want to spend all of your money on the wrong things. Then when you start to add machines that are in different geographic areas and you have global customers, you also want to be able to spend this money across the world as efficiently as possible. It’s a very complicated allocation problem that is difficult to plan too far in advance.
Another challenge was that we needed to come up with ways to keep our systems fast across those large geographic areas. When you’re integrating with multiple ad exchanges and want to access traffic wherever it’s originating, then this problem is not only tied to the geographic areas, but is also tied to the different networks with which you’re operating.
Finally, when you start to serve impressions that reach into the billions, how do you keep the system up and running? You need to keep all failures contained into the smallest possible space and not let them affect all of the different networks with which you’re integrated. An infrastructure like this needs to be able to run without human supervision. Humans aren’t fast enough to react to problems and issues that develop at this speed, so we needed to automate the software that integrates with the reporting software and all of our other tools.
After a while it can become complicated, but so far our infrastructure is very stable. As the system continues to become more sophisticated, we’re going to need more and more monitoring. Monitoring is a critical component if you want to run at this scale. It might not be the hardest part, but without it, you’re flying blind.
SD: What business or technological challenges emerge in regard to your billing system as your company grows? How have you addressed them?
VV: The problems we encounter in our billing system are certainly not simple ones. A product that we’re building today may handle billing differently than our past or future products, so we need to make sure that it’s flexible. The billing component needs to be able to accommodate all of the different billing and pricing models built on top of our platform.
So far, I would say that the biggest challenge has been integrating with international currencies and international payment gateways. We need to be able to accommodate the most common payment methods in each country. For example, in Europe, direct charge to a bank account is much more common than it is in the US. In the US, people are very comfortable with credit cards. Other methods are more common in other parts of the world. So developing a billing system that is flexible enough to cover all of these methods is challenging.
One of the reasons that supporting different currencies presents a difficulty is that typically a customer gives us a budget that covers a span of weeks or months. And unfortunately, exchange rates can fluctuate quite a lot over the course of even one week or one month.
The Euro / USD exchange rate has significantly changed over the course of a couple of months, and that would have been a massive problem if we didn’t adjust and compensate for the change. It’s of utmost importance that all of our systems react as we bid for customers from all over the world.
When we first designed the system, we didn’t realize that the ability to adjust to changing currencies would be so fundamental, but it’s proven to be the case. It’s critical to keep track of all the currencies that we directly support and to make sure that we spend the correct amount, regardless of whether the value of the currency went significantly up or down.
Valentino explains how simple solutions can increase reliability and create more time for yours customers.
SD: What’s the most surprising thing you’ve learned in scaling AdRoll’s platform?
VV: When we’re building a project for the first time, sometimes we need to implement a simple component quickly. We do it in such a way that at a certain point, this component probably won’t be able to scale. It’s just meant to be a temporary solution. Then a few months later, we discover that this simple component has turned out to be super reliable and works perfectly well.
Our whole platform is very simple – at least as simple as a global, multi-datacenter, realtime bidding platform can get – and it’s certainly very easy for an experienced engineer to hold the entire platform in his head and evaluate it. The platform is laid out very rationally and organically. There isn’t any magic involved.
It’s very funny to see over and over again how we want to build extremely intricate solutions that can adjust to any kind of situation. We want to support many different features and offer a lot of flexibility, but sometimes the simplest solution is the best. It gives you the most and allows you to spend more time on generating value for your customers. That’s probably the most surprising thing I’ve learned as we’ve scaled AdRoll’s platform.
Photography by Nick Cope. All images released under Creative Commons license. Please attribute photos to “Photographer: Nick Cope / Source: MaxCDN” and link to this post.