Taking Reputation to Scale: An Iterative Journey with an Agile Approach (Part 2)

2019-11-20T18:00:55
ID CARBONBLACK:8FA9C4EA4CE4A486771CF1A29E25831A
Type carbonblack
Reporter Sawyer Lemay
Modified 2019-11-20T18:00:55

Description

In Part 1 of this blog, we shared with you the challenges we had in balancing latency, scalability, and cost for our reputation services. In this blog, we’ll give you some insights into each major iteration along that journey, from the beginning to where we are now.

100 requests per second. Before we acquired Confer and evolved CB Defense into what it is today, our reputation services resided in our own data center and were backed by a large relational database. Servers for our CB Protection and CB Response products, residing in our customers’ environments, would sync with our database to get reputation. The sync process was not in real time and therefore had less stringent constraints than CB Defense does today.

Back then, however, with the acquisition and planned growth of CB Defense, we found that we had to quickly swap out Confer’s reputation with our own. This was done in phases, with the first phase consisting of replacing a core component with one of our own services. At this point, we had the CB Defense backend making direct requests to our data center with a sustained peak rate of 100 requests per second.

From 100 to 1,000 requests per second. That architecture worked for a while, but it was around this time that we saw a 10x increase, with the reputation request rate increasing from 100 to 1,000 requests per second within a month’s time. Not only did this result in a substantial increase in load to our data center, but the request latency from Amazon Web Services (AWS) to our data center and back was too high.

CB Defense had a pre-existing callback model that we leveraged to help solve this problem. In that model, the backend had only 3 seconds to resolve the state from unknown to known. We leveraged that model and built out a set of Python web services in AWS to enable us to perform the data center request asynchronously. This also allowed us to fully swap out the previous reputation implementation (Confer’s) with our own implementation. This implementation consisted of leveraging Amazon DynamoDB, Amazon Simple Queue Service (SQS), and Amazon Elastic Compute Cloud (EC2). DynamoDB served as a cache and was used to maintain state. SQS served as a means to delegate the responsibility of updating DynamoDB with the reputation from our data center. A Kubernetes cluster that contained the SQS queue processors was running on EC2.

From 1,000 to 6,000 requests per second. With this increase in scale, the burden to our data center became apparent. To alleviate this, we implemented batching, where items (across requests) were chunked into a single request to our data center. While doing this, it was important that we were able to resolve the above mentioned state within 2 seconds. To achieve this, we switched to a serverless model, leveraging Amazon Kinesis and AWS Lambda – replacing SQS and our EC2 workers respectively. There was an immediate and significant cost reduction, with batch sizes increasing up to a maximum size of 100. The frontend still received one item at a time, but we were able to buffer them with Amazon Kinesis, enabling batch processing.

From 6,000 to 20,000 requests per second. While this was a major improvement, we soon ran into additional issues. The latency was once again too high. We found that it took about 2 to 3 milliseconds to put data into our Amazon Kinesis stream. To address that, we modified our code to send data to a local background queue, and we took the same approach for our logging. We also optimized the authentication process between the backend services, enabling us to cache the credentials for our own services. The combined impact of those improvements resulted in a savings of 15 milliseconds.

At 20,000 requests per second, we started to see some additional costs with Amazon Kinesis. We ended up implementing something similar to the Kinesis Producer Library (KPL). This gave us the control we needed and allowed us to pre-aggregate our requests before sending them to Amazon Kinesis. By aggregating the records for only a couple of milliseconds, we were able to save thousands of dollars per month.

Latency too slow at 30 milliseconds. At this point, our request rate was still growing, and while our latency was down to 30 milliseconds, it was still too slow for us. We then decided to reduce our network latency and improve connectivity between our reputation and CB Defense backend by leveraging AWS PrivateLink, which allowed all of our network traffic to stay within AWS. This change enabled us to remove our gateway, with CB Defense being able to directly hit our load balancer. This change shaved off latency, bringing it down to about 20 milliseconds. It also reduced the network costs by thousands of dollars and the EC2 cost by another substantial chunk.

From 20,000 to 40,000 to 90,000 requests per second. To stay ahead of our growth and reduce our costs, we decided to make a major architectural change. The goal was to completely remove the call to our data center. This goal was multifaceted: We were expanding our geographic presence, we wanted to further reduce costs, we wanted to eliminate the callback (and thereby dramatically reduce the request rate), and our data center was struggling to sustain the rate.

An added complication with Amazon Kinesis was that it took about 700 milliseconds before an item was retrievable, leaving about 1.3 seconds to make a request to our data center, get an answer, and update our cache. When we were based in only one geography, this was sufficient; however, once the request had to travel around the world, we found that the latency was too high, and sometimes the request couldn’t be fulfilled in time.

We addressed this problem by building an internal reputation feed. Sticking with serverless technologies, we leveraged Amazon Simple Notification Service (SNS), Amazon Simple Storage Service (S3), and AWS Lambda. We continued to use DynamoDB, leveraging it as our database – instead of as a cache – to store our intelligence on over 2 billion files.

Shortly after this implementation, the request rate unexpectedly grew to 40,000 per second over the course of a week, then to 50,000 the next week, and peaking soon after at 90,000 requests per second. With the backend requests across our entire environment, we were exceeding 300,000 per second at this point.

From there to today. This required another major architectural change and brought us to where we are today. Our request rate is currently around 40,000 per second on a normal business day, with a latency that is down to 13 milliseconds, while sustaining a low cost-per-request. All of this supports the more than 600,000 reputation requests per second that are observed across our products.

In future iterations, we will once again work towards faster latency at a larger scale while working to improve reliability and achieve greater visibility.

If you are a security engineer and want to join us as we continue to develop our services here at Carbon Black, take a look at our openings.

The post Taking Reputation to Scale: An Iterative Journey with an Agile Approach (Part 2) appeared first on VMware Carbon Black.