When it comes to serving reputation, even a millisecond of latency could create havoc, resulting in the unwanted spread of malware and causing costly consequences that no security company or customer wants. And that’s why we, as engineers here at Carbon Black, are constantly working towards achieving the highest performing services possible.
Our reputation services are one of Carbon Black’s key differentiators, serving as the baseline for our threat intelligence, and supporting over 5,600 customers and 500 partners that use our CB Defense product. They also support most of Carbon Black’s offerings, including CB ThreatHunter.
Maintaining low latency while being able to scale beyond expected customer growth has been quite an engineering challenge, especially since all of this was done while reducing overall costs!
It’s not unusual for our reputation services to exceed 1.6 billion requests a day (with sustained peaks surpassing 40,000 requests per second) from nearly 3 million CB Defense sensors on our customers’ endpoints. Each of these sensors has an average of 4,500 events on any given day. (Note: These numbers are only for requests outside of our caching layer, which is the primary focus of this blog; as such that number doesn’t include the tens of billions of daily requests served by our caches.)
It was only three years ago when we had a mere 100 reputation requests per second. But our customer base has grown significantly and we have now successfully evolved our system to the point where it can handle up to 100,000 requests per second with a near real-time latency of 13 milliseconds round-trip, backend-to-backend.
Attaining those numbers required a substantial engineering effort, especially since the scaling trends tended to double previous numbers. Additionally, care had to be taken with even auxiliary operations such as logging and metrics, as every action impacted latency, which therefore affected our scalability and cost. Here’s a deeper look into what we faced with each of those issues:
Balancing these issues has been a challenge for our team, but with each iteration, we found a solution that worked – until it didn’t any more. And then we found another solution, and another, until we ended up where we are today – at an acceptable latency, but with more work to do.
In Part 2 of this blog, which will be published shortly, we’ll give you an inside perspective of how we scaled our services while keeping the latency and costs low.
Meanwhile, if you are a security engineer and are intrigued by what we’re doing, take a look at the openings we have here at Carbon Black.
The post Taking Reputation to Scale: The Delicate Balance of Latency, Scale, and Cost (Part 1) appeared first on VMware Carbon Black.