In the previous blog posts in this series, we discussed the motivation for clustering attacks and the data used and how to calculate the distance between two attacks using different methods on each feature we extracted. In this final blog post, we’ll discuss the clustering algorithm itself – how to use the distance we calculated to create clusters from the data. We will discuss clustering in real time when only a small amount of data can be stored in memory. Finally, we’ll show some results of the algorithm based on real data from Imperva customers.
Now we have all the basic ingredients to input into the algorithm. What’s left to decide is which clustering algorithm to use. There are many algorithms to choose from that meet varying needs, for example, we’ve previously written about clustering techniques used in Imperva CounterBreach.
Here’s where the algorithm reality punched us right in the face: the demand from our engineering team was that the clustering is done in****real time. Meaning each time a new event enters the system the algorithm needs to decide on the spot how to cluster it and update the current clustering state. This had been done with minimum memory, which meant that individual events could not be stored in memory.
The more popular and well-known clustering algorithms work on a batch of data instead of a stream, i.e., their input is a static dataset. So, this real-time requirement meant we had to look for other algorithms that work in streaming mode.
There are a couple of methods to use to cluster a stream of data. We won’t discuss these methods as they are more complex and technical, instead, we’ll present the requirements of our algorithm and what was needed for them to be met.
First, a clustering algorithm in streaming mode needs to make decisions in real time, meaning that the algorithm maintains in memory a current state of the clusters and each time a new event enters the system the algorithm updates the clustering state. This is done instantaneously and without storing the discrete event in memory.
Second, we need to remember that each time the algorithm was making a decision it was doing so based on partial data. That’s because the algorithm only processed past data. If the algorithm were to know allof the events (past and future events) the decision it would make might be different. So, the algorithm must have a way to undo decisions it made in the past. The way the algorithm undoes its decisions is by splitting a cluster into smaller parts and merging the parts together into other clusters that are the best fit.
Finally, most of the streaming clustering algorithms from academic articles work on spatial data. This means that their input are points in a Euclidean space (think of it as coordinates in an n-dimensional space). Our data is more complex, it contains URLs which are strings, IPs, geographic coordinates and other varying features. These features cannot be easily embedded into a Euclidean space, and even if they would it would make no sense to do so. So, the algorithm we needed must assume only that we can calculate the distance between two data points, and not that they are embedded into a Euclidean space.
We used a homegrown algorithm to answer these needs. Clustering in streaming mode is always a trade-off between accuracy of the results and the time and memory efficiency. We tried to find the balance so the result would be as accurate as possible storing only the minimum amount of data needed in memory while performing the least possible amount of calculations with each incoming event. See Figure 7 for the general flow of clustering in streaming mode.
Figure 7: Clustering in streaming mode - clusters may change due to new events entering the system
We stored aggregated structured data in memory instead of raw events; this way we were able to split clusters, to some extent, and rearrange them as would seem most appropriate. Also, in order to process data in real time, most of the time we used a light-weight distance function that wouldn’t take too much time to calculate and didn’t consider all the features. We used a heavier and more accurate distance function that considered all the features only at predefined times when there were enough new events that entered the system, as we expected the clustering state might change significantly.
Also, for performance considerations, we couldn’t cluster all the events from the beginning each time a new event entered the system. That’s why every time a new event came in the algorithm used its current clustering state to do calculations only on the clusters that may change due to the new event. This way we significantly reduced the time it took to process each new event.
For validation of the algorithm, some of our web application firewall (WAF) customers provided us with logs containing events from their WAF. Here are three highlighted clusters which contain incidents we thought were interesting:
CVE-2017-7529 is a vulnerability of Nginx that allows an attacker to launch an integer overflow attack using a crafted “range” header. We saw a cluster on a customer’s WAF containing over two thousand attacks from over 100 distinct IPs over a period of three days trying to exploit this vulnerability. Over 80% of the attacks came from the US and most of the attacks seemed to use the same attack tool. Also, the attack targeted many different URLs, although it targeted only two resource extensions: PDF and CFM.
Email collector robots try to scrape web applications to find email addresses. The purpose for email harvesting is mostly to collect lists of emails in order to sell them to spammers. We saw a cluster on a customer’s WAF which contained over 50 distinct IPs that performed email harvesting. The source of these attacks was very distributed, from the US, Europe, South America and Asia. Most of the targets were the home page of the application. This means that after the robots were blocked at the home page they didn’t proceed to scrape the rest of the site, probably moving on to try other websites which are not protected by a WAF. The same cluster was also found in more than five different web applications we analyzed indicating this is a popular attack.
In previous blog posts we discussed Apache Struts vulnerabilities, and how they are very popular among attackers, especially ones from Asian countries. CVE-2017-5638 is an Apache Struts vulnerability published on March 2017 that allows attackers to launch remote code execution attacks using a crafted “content-type” header. We saw a cluster of attacks trying to utilize this vulnerability; most of the attacks came from China and the target was very distributed, containing multiple URLs. Also, in addition to this specific vulnerability, the attackers tried to utilize other vulnerabilities of Apache Struts. This is a popular phenomenon we see in our data: attackers trying to utilize different vulnerabilities of the same system, in this case Apache Struts. The cluster appeared on over ten different web applications we analyzed, and all the clusters contained similar attributes. This indicates the popularity of Apache Struts vulnerabilities among attackers.
Clustering application attacks is a challenging task that requires a lot of research and experimentation. Throughout the process, we encountered many difficulties and made a number of decisions regarding the algorithm. Many due to real life constraints not seen in academic research. Customer applications don’t live in a lab so the solutions that protect them can’t either.
Knowledge of the application security domain and a deep understanding of data are both – in our experience – crucial prerequisites for the design and implementation of any successful machine learning algorithm built to protect apps and the data that connects to them.
Learn more about protecting apps from attacks with Imperva SecureSphere or Imperva Incapsula Web Application Firewall.