There are many ways to achieve robust detection of malicious or potentially malicious cyber activity in an organisation. But increasingly you hear of the terms “Machine Learning” and “Artificial Technologies” from security technology vendors and suppliers as the solution. The pitch is then often accompanied with vigorous justification of the technical credibility and generally married with terms like “unsupervised” or “anomaly detection” to add a vague complexity to the product or service.
In reality though, behind the marketing, how practical is it use Machine Learning to perform good signal and low noise detection?
The problem space
First, before implementing anything, we need to determine the problem space we are trying to solve. Machine learning isn’t just something you turn on, it’s a set of tools that you can use to solve a specific problem.
Our first bit of advice is that it shouldn’t be used on every problem – you wouldn’t hammer in a nail with a spanner!
The toolbox and its appropriateness
We need to consider if using machine learning is a good solution to our problem; computationally is it often more expensive to use an element of machine learning than more traditional techniques. So we should seek to implement simple solutions to problems where it is possible and using complex methods where things require it.
An example problem
For this example, we’ll select the problem of random file names. In a few instances, random file names are used by malware or implants when creating files on a disk.
They employ this technique to avoid traditional Indicator of Compromise signatures. A traditional technique would be to look for new files with a particular pattern or set of characters, which isn’t a good solution in this case (as the file name can be anything!).
When you consider this problem it can be broken down into two parts:
- What is random?
- Does this file name I am looking at fit my understanding of random?
If we were to flesh out the question “what is random?” to fit our file name problem we need to consider human language. File names often contain words written by developers or users so that they can be easily identified for use later on, so it’s fair to assume we are going to see a lot of human language (warning: regional considerations here).
We can’t, however, base our understanding on randomness based on language alone as sometimes acronyms are used. A good example is MSTSC.exe; here we’d be analysing the phrase “MSTSC” to see how random it is. There are no complete words in this set of characters so it could be determined as random, but this is the Remote Desktop Service client, a legitimate Windows executable.
So how can we account for this?
N-gram scoring (https://en.wikipedia.org/wiki/N-gram) can be used to produce the probability of a sequence of strings typically in pairs (bigrams) or sets of three (trigrams). If we split the word “learning” up into trigrams we get:
If we use this technique on a large set of words then we can start to calculate the probability of seeing various trigrams and the less probable the trigrams are in a word, the more random the word inherently is. This also removes the issue of acronyms if in our model we supply a list of “words” with common acronyms found in operating systems, as less random (i.e. well used) files will appear often, making those trigrams more common.
Before we can model our data using n-gram scoring, we have to prepare and clean the data so that it is uniform. Any data that isn’t uniform or standardised has the risk of skewing our results and therefore invalidating our model.
In a technology such as Splunk, we use Data Models (https://docs.splunk.com/Documentation/Splunk/8.0.3/Knowledge/Aboutdatamodels) as a mechanism to do this at scale. Using data models we can perform normalisation of data across data sources and query an abstract version of the data. For example, this allows us to normalise Microsoft Windows logs, Endpoint Detection & Response (EDR) product logs, Linux system logs, Antivirus logs… and any other source which contains a filename. In addition, this filters out any malformed data being compared with our model, preventing erroneous results.
Running the analytics
Models are computationally expensive to run and we don’t want to waste CPU time that could be better spent elsewhere.
One solution that we use to mitigate this problem is to split the analytic into two sections:
- Create a scheduled task to build the model:
- In our example, we collected 10 open source books, a Microsoft Windows manual and file names from a fresh Microsoft Windows 10 install and then broke this down into a word list
- We then created a custom algorithm in Python and imported it into our Splunk environment
- We used a case sensitive algorithm as often the “random” file names we are looking for use both upper and lower case
- Our scheduled task takes the list of file names and computes the model using the custom command to calculate an understanding randomness
- The word list can be then regularly updated (to continue it’s training) which will update the understanding of randomness
- Create a scheduled search to compare recently observed file names:
- This is inexpensive and should run often on a recent snapshot of data (for example, every 15-20 minutes with the aspiration of real-time) to lower the “time to detection”. This way you can inspect file names very often to identify one that is random
- This search should alert if the word is considered “random”
- This is measured by a score given to the filename which should be adjusted to suit the environment in which it is deployed
- It is also important to consider what other information to display with the scoring to allow analysts to effectively triage the alert, for example:
- The asset the filename was detected on
- Any user associated with the file (if available)
- The creating process, thread and the library in which the code originated from
- The file path the file was found in
- Its size
Using the trained model, we applied our understanding of randomness against 246 different file names (initially 1,814,396 events from a lab environment but was reduced via aggregation to 246 unique file names), from an environment with data from Microsoft Windows 10 host, in order to understand what was appropriate for alert thresholds.
These events were selected by including file paths that are commonly used by malware and implants which are understood to use random file names. This was done to prevent any unnecessary load during the search and as a step to reduce the false positive rate understanding this introduces fragility into the model.
In this example we included the following file paths (note, an asterisk represents a wildcard):
In addition, we initially limited the file names to the following extensions only:
In addition, 4 files names were injected into the data based on file names that we had observed being used by malware previously. This was to ensure we had a good source of “random” file names from which we could test our alerting threshold:
Running the analytic under these circumstances and taking the 20 highest scored file names (higher means more randomness) gave the following results:
|Full File Path||File Name||Probability|
We can see that all 4 of our random file names made it to the top 20, with scores over 9. If we filtered above a score of 9 however we’d get a very high false positive rate.
During the experimentation of reducing the false positive rate of this analytic we found the following steps provided a better outcome:
- Remove the event with the term “Dropbox”
- This is an expected term, and should be added to the original word list later to train the model more effectively another time
- Run the same algorithm over the remaining 20 file names, but convert them to lower case first
- This gave us an additional score for the file names
- Combine the two scores to produce a combined score
- Show the results where the original probability score was greater than 9, the lower case probability score was greater than 5 and the combined score is greater than 15
By applying these steps we now get the following 16 results, a reduction of 4:
|File Name||Probability||Lower Case Probability||Combined Probability|
While we have captured all the random events, the false positive rate remains high. We could alert on where the probability score is over 10 and reduce the false positives, but the consideration for a security operation is whether the reduction of false positives is worth the false negative trade off.
It is likely that this tuning when applied, while providing some benefit in this environment, is likely to be specific only to this environment. The analytic does provide some variables that can be changed which allows for a mechanism for analysts to tune the analytic on a case by case basis.
In addition, it’s easy to see how the rate of false positives must be measured and the analytic often refined for accuracy. In this example, it is probable that increasing the initial word list and tuning out known good Microsoft Windows file names would result in significant reduction in false positives.
In this post we’ve explored the high-level steps of making an real-world analytic for a Security Operations function that relies on “machine learning” and specifically highlighted that it is a problem driven solution that requires a continual effort to maintain and not just a blanket way of solving the detection problem space in its entirety.