Taking Reputation to Scale: The Delicate Balance of Latency, Scale, and Cost (Part 1)

2019-11-06T18:00:14
ID CARBONBLACK:670F50B61E16EFEBF207BEBFD3F10AC1
Type carbonblack
Reporter Sawyer Lemay
Modified 2019-11-06T18:00:14

Description

When it comes to serving reputation, even a millisecond of latency could create havoc, resulting in the unwanted spread of malware and causing costly consequences that no security company or customer wants. And that’s why we, as engineers here at Carbon Black, are constantly working towards achieving the highest performing services possible.

Our reputation services are one of Carbon Black’s key differentiators, serving as the baseline for our threat intelligence, and supporting over 5,600 customers and 500 partners that use our CB Defense product. They also support most of Carbon Black’s offerings, including CB ThreatHunter.

Maintaining low latency while being able to scale beyond expected customer growth has been quite an engineering challenge, especially since all of this was done while reducing overall costs!

Balancing three critical factors: Latency, scale, and costs

It’s not unusual for our reputation services to exceed 1.6 billion requests a day (with sustained peaks surpassing 40,000 requests per second) from nearly 3 million CB Defense sensors on our customers’ endpoints. Each of these sensors has an average of 4,500 events on any given day. (Note: These numbers are only for requests outside of our caching layer, which is the primary focus of this blog; as such that number doesn’t include the tens of billions of daily requests served by our caches.)

It was only three years ago when we had a mere 100 reputation requests per second. But our customer base has grown significantly and we have now successfully evolved our system to the point where it can handle up to 100,000 requests per second with a near real-time latency of 13 milliseconds round-trip, backend-to-backend.

Attaining those numbers required a substantial engineering effort, especially since the scaling trends tended to double previous numbers. Additionally, care had to be taken with even auxiliary operations such as logging and metrics, as every action impacted latency, which therefore affected our scalability and cost. Here’s a deeper look into what we faced with each of those issues:

  • Latency: As mentioned above, latency is one of the most critical factors in the delivery of reputation. To successfully protect our customers from the threat of malware, we must keep our latency – both for response times and state resolving – as low as possible. This means our service has to be fast enough to maintain our pipeline. In other words, we cannot afford to delay the time between an event and an alert. For state resolving (the process of going from a reputation of unknown to known), our goal is to have a final verdict ready within 2 seconds. For response time, we want to be as close to 0 as possible for the roundtrip time from a customer's endpoint to our backend. This requires us to maintain an internal (backend-backend) latency of no more than 15 milliseconds. We are currently at an acceptable latency of 13 milliseconds – but we know we can do better.
  • Scalability: As our customer base and offerings continue to grow, we of course have to continually figure out how to scale our services to support more requests. This can be particularly challenging, especially when a significant increase happens in a very short time. For instance, we once had a 10x increase in reputation requests within a month’s time. As you can imagine, that required some pretty fast thinking on our part to keep up with the requests while reducing the strain on our data center and without compromising on latency. At the time, our data center was used to perform the bulk of the reputation computation.
  • Cost: As we scaled our resources up to meet greater demand, our costs increased, sometimes to the point where the changes we wanted to make would be cost-prohibitive. In those cases, we had to evolve our architecture while maintaining a balance between performance, cost, and implementation time. All of this had to be done without any downtime.

Balancing these issues has been a challenge for our team, but with each iteration, we found a solution that worked – until it didn’t any more. And then we found another solution, and another, until we ended up where we are today – at an acceptable latency, but with more work to do.

In Part 2 of this blog, which will be published shortly, we’ll give you an inside perspective of how we scaled our services while keeping the latency and costs low.

Meanwhile, if you are a security engineer and are intrigued by what we’re doing, take a look at the openings we have here at Carbon Black.

The post Taking Reputation to Scale: The Delicate Balance of Latency, Scale, and Cost (Part 1) appeared first on VMware Carbon Black.