Protecting cloud workloads from zero-day vulnerabilities like Log4Shell is a challenge that every organization faces.
When a vulnerability is published, organizations can try to identify impacted artifacts through software composition analysis, but even if they’re able to identify all impacted areas, the patching process can be cumbersome and time-consuming. As we saw with Log4Shell, this can become even more complicated when the vulnerability is nearly ubiquitous.
But patching doesn’t address the risk organizations face during the time period of zero-day discovery and publication. According to MIT Technology Review, there were at least 66 zero-day vulnerabilities discovered in 2021 — the most on record in a single year.
Preventing zero-day exploitation from impacting an organization requires a proactive approach that automatically detects and protects customers against post-exploitation activity.
CrowdStrike Falcon Cloud Workload Protection has added context-driven anomaly detections to fill these gaps by providing robust zero-day protection for applications that run in the cloud.
Table of Contents
Why Use Anomaly Detection?
We understand our various adversaries and their core objectives well, which usually allows us to find them regardless of how they get into a system. For example, a cryptojacker will typically download and start up a coin miner. A ransomware-focused group will usually end up trying to encrypt a lot of files. And all groups, regardless of their objective, will probably conduct some reconnaissance.
As the result of moving our workloads to the cloud, though, our attack surface has become larger, more complex and faster-changing than ever. At the same time, the army of strongly motivated threat actors trying to find stealthy new ways to exploit this hard-to-defend surface grows every day. There will always be cases of clever actors behaving in novel ways and evading defenses set to look for specific, expected patterns. Log4Shell, for example, makes it easy for attackers to inject behavior into trusted applications, leveraging the power of the Java virtual machine (JVM) to achieve their objectives in new ways.
However, the shift to cloud technologies also gives us a new advantage. Cloud-based workloads tend to be small, single-purpose and immutable, and we have visibility into earlier pre-deployment stages of the development cycle that give us useful information about their components and intended configuration. Unlike a general-purpose computer, we can predict workload behavior within certain contexts. This means that, in addition to trying to predict which specific abnormal behaviors attackers might introduce to the system, we can define “normal behavior” and flag any significant deviation from it — including novel attacks on undisclosed vulnerabilities.
Profiling and anomaly detection are complicated techniques with checkered pasts. They are certainly not sufficient as a detection strategy on their own, and great care has to be taken to avoid a flood of false positives. We need to introduce them thoughtfully and correctly in a way that complements our existing attacker-focused detections and provides another layer of protection against zero-day and known threats without negating the benefits of tried and true approaches.
New Context Enables Anomaly Detections
To enable the first iteration of anomaly detections, we added context to the Falcon sensor, allowing us to segment its telemetry in ways that make it more predictable. If we know that certain events come from an Apache Tomcat instance hosting a particular service in a particular Kubernetes cluster, for example, we can make better judgments about whether they describe typical behavior than we can about events from an arbitrary Linux server we know nothing about.
The new context includes information attached to process trees:
- The specific long-running application that spawned the tree (e.g., “Weblogic”)
- Whether the tree is the result of a Docker exec command
- Whether a tree appears to contain signs of hands-on keyboard activity
It becomes even more powerful when combined with data from external sources, like user policies, and when grouped via machine learning.
Initial Anomaly Detections Using New Context
By analyzing data from millions of sensors, we can identify invariants and encode them as high-confidence detections. We know, for example, that redis does not typically spawn an interactive shell anywhere in its process tree. It also does not modify crontab. Both actions are extremely security-relevant, so it makes sense to use these insights to alert when redis does either.
Newly added telemetry makes it easier to capture what is and isn’t normal for a particular containerized workload such as:
- Direct connection to kubelet made (potential lateral movement attempt)
- External process memory accessed (potential credential theft)
- New executables added to a running container
- Interactive sessions started in a container
- Horizontal port scan
- Normal port scan
- And more
Without context, this telemetry doesn’t get us far. There are plenty of workloads that connect directly to kubelet — monitoring software, for example. Also, a surprising number of legitimate applications conduct port scanning or port scan-like activity.
Adding context boosts the signal: It’s very unusual to exec into a production container and start a port scan, or exec into a production container and start making mutating calls to kubelet endpoints. We also happen to know that many cloud-aware threat actors do both things.
These initial detections ride the line between anomaly and traditional attacker-technique focused detections. We’re diving into the data to find out what’s abnormal in certain situations, but our work is still driven by hunches about obvious universal deviations from standard behavior.
Gain Understanding with More Sophisticated Anomaly Detections
Several key pieces will make more sophisticated anomaly detections possible — detections capable of automatically identifying subtle changes in specific services within specific environments, including:
- More powerful user policies
- An understanding of higher-level cloud workload constructs like “service”
- Telemetry aggregation across these constructs and over time
- More information from build and deploy phases
As these pieces fall into place, and we gain a better understanding of the workloads we protect via context supplied by the user and extracted from sources like Kubernetes, we can begin to make more confident predictions on the fly: pods that are part of the “payroll” service do not make any connections to the external internet. Clusters designated “production” do not normally have users exec’ing into containers and installing packages (and so on). As an example in Falcon sensor 6.35, we can examine basic process profiling of individual containers and enhanced drift detection, as seen in Figure 1.
With rich enough telemetry and context, we can also begin to bring machine learning to bear on the problem, although that’s a topic for a different post.
Block Zero-days Before They’re Exploited
Reliable detection of deviations from expected behavior can be very powerful for stopping breaches. A Java workload running in a container with a vulnerable version of Log4J will begin behaving differently once compromised. It might make new outgoing network connections, add or run new executable files, or edit critical system files it doesn’t normally touch. All of these things would set off alerts even before Log4Shell was in the headlines.
Within cloud workload protection, we remain attacker-focused and will always research and detect the specific tactics of significant threat actors. However, by taking advantage of the opportunities the new, cloud-centric technology stack provides, we’re beginning to build another layer of protection through anomaly detection. Both layers will block the next zero-day before anyone even knows it exists.
Leave a Reply