Back

Liam Stevenson

Machine Learning

Managed Detection & Response

April 29, 2020

7 mins read

Practical Machine Learning for Random (Filename) Detection

There are many ways to achieve robust detection of malicious or potentially malicious cyber activity in an organisation. But increasingly you hear of the terms “Machine Learning” and “Artificial Technologies” from security technology vendors and suppliers as the solution. The pitch is then often accompanied with vigorous justification of the technical credibility and generally married with terms like “unsupervised” or “anomaly detection” to add a vague complexity to the product or service.

In reality though, behind the marketing, how practical is it use Machine Learning to perform good signal and low noise detection?

The problem space

First, before implementing anything, we need to determine the problem space we are trying to solve. Machine learning isn’t just something you turn on, it’s a set of tools that you can use to solve a specific problem.

Our first bit of advice is that it shouldn’t be used on every problem – you wouldn’t hammer in a nail with a spanner!

The toolbox and its appropriateness

We need to consider if using machine learning is a good solution to our problem; computationally is it often more expensive to use an element of machine learning than more traditional techniques. So we should seek to implement simple solutions to problems where it is possible and using complex methods where things require it.

An example problem

For this example, we’ll select the problem of random file names. In a few instances, random file names are used by malware or implants when creating files on a disk.

They employ this technique to avoid traditional Indicator of Compromise signatures. A traditional technique would be to look for new files with a particular pattern or set of characters, which isn’t a good solution in this case (as the file name can be anything!).

When you consider this problem it can be broken down into two parts:

What is random?
Does this file name I am looking at fit my understanding of random?

How random

If we were to flesh out the question “what is random?” to fit our file name problem we need to consider human language. File names often contain words written by developers or users so that they can be easily identified for use later on, so it’s fair to assume we are going to see a lot of human language (warning: regional considerations here).

We can’t, however, base our understanding on randomness based on language alone as sometimes acronyms are used. A good example is MSTSC.exe; here we’d be analysing the phrase “MSTSC” to see how random it is. There are no complete words in this set of characters so it could be determined as random, but this is the Remote Desktop Service client, a legitimate Windows executable.

So how can we account for this?

N-gram scoring (https://en.wikipedia.org/wiki/N-gram) can be used to produce the probability of a sequence of strings typically in pairs (bigrams) or sets of three (trigrams). If we split the word “learning” up into trigrams we get:

Lea,Ear,Arn,Rni,Nin,ing

If we use this technique on a large set of words then we can start to calculate the probability of seeing various trigrams and the less probable the trigrams are in a word, the more random the word inherently is. This also removes the issue of acronyms if in our model we supply a list of “words” with common acronyms found in operating systems, as less random (i.e. well used) files will appear often, making those trigrams more common.

Data Engineering

Before we can model our data using n-gram scoring, we have to prepare and clean the data so that it is uniform. Any data that isn’t uniform or standardised has the risk of skewing our results and therefore invalidating our model.

In a technology such as Splunk, we use Data Models (https://docs.splunk.com/Documentation/Splunk/8.0.3/Knowledge/Aboutdatamodels) as a mechanism to do this at scale. Using data models we can perform normalisation of data across data sources and query an abstract version of the data. For example, this allows us to normalise Microsoft Windows logs, Endpoint Detection Response (EDR) product logs, Linux system logs, Antivirus logs… and any other source which contains a filename. In addition, this filters out any malformed data being compared with our model, preventing erroneous results.

Running the analytics

Models are computationally expensive to run and we don’t want to waste CPU time that could be better spent elsewhere.

One solution that we use to mitigate this problem is to split the analytic into two sections:

Create a scheduled task to build the model:
- In our example, we collected 10 open source books, a Microsoft Windows manual and file names from a fresh Microsoft Windows 10 install and then broke this down into a word list
- We then created a custom algorithm in Python and imported it into our Splunk environment
  - We used a case sensitive algorithm as often the “random” file names we are looking for use both upper and lower case
- Our scheduled task takes the list of file names and computes the model using the custom command to calculate an understanding randomness
- The word list can be then regularly updated (to continue it’s training) which will update the understanding of randomness
Create a scheduled search to compare recently observed file names:
- This is inexpensive and should run often on a recent snapshot of data (for example, every 15-20 minutes with the aspiration of real-time) to lower the “time to detection”. This way you can inspect file names very often to identify one that is random
- This search should alert if the word is considered “random”
  - This is measured by a score given to the filename which should be adjusted to suit the environment in which it is deployed
- It is also important to consider what other information to display with the scoring to allow analysts to effectively triage the alert, for example:
  - The asset the filename was detected on
  - Any user associated with the file (if available)
  - The creating process, thread and the library in which the code originated from
  - The file path the file was found in
  - Its size

The result

Using the trained model, we applied our understanding of randomness against 246 different file names (initially 1,814,396 events from a lab environment but was reduced via aggregation to 246 unique file names), from an environment with data from Microsoft Windows 10 host, in order to understand what was appropriate for alert thresholds.

These events were selected by including file paths that are commonly used by malware and implants which are understood to use random file names. This was done to prevent any unnecessary load during the search and as a step to reduce the false positive rate understanding this introduces fragility into the model.

In this example we included the following file paths (note, an asterisk represents a wildcard):

C:Users*Downloads
C:windows*
C:Users*AppData
C:Users*AppDataLocalTemp
C:Users*AppDataRoaming

In addition, we initially limited the file names to the following extensions only:

.dll
.vbs
.bat
.exe
.tmp
.dat
.txt
.zip
.ps1
.ink
.csv
.json

In addition, 4 files names were injected into the data based on file names that we had observed being used by malware previously. This was to ensure we had a good source of “random” file names from which we could test our alerting threshold:

QNxPOoodgpWs.exe
bf27b28.ps1
twKAO9jbPUxOJGWKw.exe
awMiOFl.zip

Running the analytic under these circumstances and taking the 20 highest scored file names (higher means more randomness) gave the following results:

Full File Path	File Name	Probability
C:UsersTestuserDownloadstwKAO9jbPUxOJGWKw.exe	twKAO9jbPUxOJGWKw	14.02564252250293
C:WindowsSystem32HOSTNAME.EXE	HOSTNAME	11.641312817737646
C:WindowsSystem32wbemWMIADAP.exe	WMIADAP	11.473349274948804
C:WindowsSystem32NETSTAT.EXE	NETSTAT	11.325378012822753
C:UsersTestuserDownloadsQNxPOoodgpWs.exe	QNxPOoodgpWs	11.080201433044984
C:UsersTestuserDownloadsawMiOFl.zip	awMiOFl	10.826588199731106
C:WindowsSoftwareDistributionDownloadInstallAM_Delta_Patch_1.273.337.0.exe	AM_Delta_Patch_1.273.337.0	10.77450761706021
C:WindowsSystem32SppExtComObj.Exe	SppExtComObj	10.19163059525599
C:WindowsSystem32VSSVC.exe	VSSVC	9.958672238091363
C:WindowsSystem32wbemWmiPrvSE.exe	WmiPrvSE	9.746724892660179
C:WindowsInstallerMSI981.tmp	MSI981	9.627533831701411
C:WindowsSystem32Windows.WARP.JITService.exe	Windows.WARP.JITService	9.57404664632686
C:WindowsSystem32wbemWmiApSrv.exe	WmiApSrv	9.530512653876256
C:WindowsSystem32WUDFHost.exe	WUDFHost	9.454666477549624
C:UsersTestuserAppDataLocalDropboxUpdateInstall{20BE0C72-7D8B-7E45-8BD6-39C44C35964F}DropboxClient_53.4.67.exe	DropboxClient_53.4.67	9.247451275341293
C:WindowsSoftwareDistributionDownloadInstallMpSigStub.exe	MpSigStub	9.226773828318503
C:WindowsSystem32MpSigStub.exe	MpSigStub	9.226773828318503
C:UsersTestuserDownloads7z1900-x64.exe	7z1900-x64	9.21284553126196
C:WindowsSystem32PING.EXE	PING	9.178679419959044
C:UsersTestuserDownloadsbf27b28.ps1	bf27b28	9.10041590894542

Initial top 20 results

We can see that all 4 of our random file names made it to the top 20, with scores over 9. If we filtered above a score of 9 however we’d get a very high false positive rate.

During the experimentation of reducing the false positive rate of this analytic we found the following steps provided a better outcome:

Remove the event with the term “Dropbox”
1. This is an expected term, and should be added to the original word list later to train the model more effectively another time
Run the same algorithm over the remaining 20 file names, but convert them to lower case first
1. This gave us an additional score for the file names
Combine the two scores to produce a combined score
Show the results where the original probability score was greater than 9, the lower case probability score was greater than 5 and the combined score is greater than 15

By applying these steps we now get the following 16 results, a reduction of 4:

File Name	Probability	Lower Case Probability	Combined Probability
twKAO9jbPUxOJGWKw	14.02564252250293	10.23963040514879	24.26527292765172
WMIADAP	11.473349274948804	5.038843682043336	16.512192956992140
QNxPOoodgpWs	11.080201433044984	7.19174236184474	18.27194379488972
awMiOFl	10.826588199731106	5.043856933376094	15.870445133107200
AM_Delta_Patch_1.273.337.0	10.77450761706021	9.225814207938893	20.00032182499911
SppExtComObj	10.19163059525599	5.814316303333223	16.00594689858922
VSSVC	9.958672238091363	5.500464534239088	15.459136772330451
WmiPrvSE	9.746724892660179	5.529086860966785	15.275811753626964
MSI981	9.627533831701411	7.590279947779553	17.217813779480963
Windows.WARP.JITService	9.57404664632686	6.911562918090688	16.48560956441755
WmiApSrv	9.530512653876256	5.700127932528084	15.230640586404341
WUDFHost	9.454666477549624	6.073900352974089	15.528566830523713
7z1900-x64	9.21284553126196	9.21284553126196	18.42569106252392
bf27b28	9.10041590894542	9.10041590894542	18.20083181789084

While we have captured all the random events, the false positive rate remains high. We could alert on where the probability score is over 10 and reduce the false positives, but the consideration for a security operation is whether the reduction of false positives is worth the false negative trade off.

It is likely that this tuning when applied, while providing some benefit in this environment, is likely to be specific only to this environment. The analytic does provide some variables that can be changed which allows for a mechanism for analysts to tune the analytic on a case by case basis.

In addition, it’s easy to see how the rate of false positives must be measured and the analytic often refined for accuracy. In this example, it is probable that increasing the initial word list and tuning out known good Microsoft Windows file names would result in significant reduction in false positives.

Conclusion

In this post we’ve explored the high-level steps of making an real-world analytic for a Security Operations function that relies on “machine learning” and specifically highlighted that it is a problem driven solution that requires a continual effort to maintain and not just a blanket way of solving the detection problem space in its entirety.

Published by Liam Stevenson

View all posts by Liam Stevenson ->

Here are some related articles you may find interesting

Ghidra nanoMIPS ISA module

Introduction In late 2023 and early 2024, the NCC Group Hardware and Embedded Systems practice undertook an engagement to reverse engineer baseband firmware on several smartphones. This included MediaTek 5G baseband firmware based on the nanoMIPS architecture. While we were aware of some nanoMIPS modules for Ghidra having been developed…

Hardware & Embedded Systems

Reverse Engineering

Tool Release

May 7, 2024

6 mins read

Sifting through the spines: identifying (potential) Cactus ransomware victims

Authored by Willem Zeeman and Yun Zheng Hu This blog is part of a series written by various Dutch cyber security firms that have collaborated on the Cactus ransomware group, which exploits Qlik Sense servers for initial access. To view all of them please check the central blog by Dutch…

Digital Forensics and Incident Response (DFIR)

Fox-IT and European Research

Vulnerability Research

April 25, 2024

7 mins read

Public Report – Confidential Mode for Hyperdisk – DEK Protection Analysis

During the spring of 2024, Google engaged NCC Group to conduct a design review of Confidential Mode for Hyperdisk (CHD) architecture in order to analyze how the Data Encryption Key (DEK) that encrypts data-at-rest is protected. The project was 10 person days and the goal is to validate that the…

Public Reports

April 12, 2024

1 min read

View articles by category

Most recent posts

Call us before you need us.

Our experts will help you.

Get in touch

Practical Machine Learning for Random (Filename) Detection

The problem space

The toolbox and its appropriateness

An example problem

How random

Data Engineering

Running the analytics

The result

Conclusion

Like this:

View articles by category

Most popular posts

Most recent posts

Call us before you need us.

Practical Machine Learning for Random (Filename) Detection

The problem space

The toolbox and its appropriateness

An example problem

How random

Data Engineering

Running the analytics

The result

Conclusion

Share this:

Like this:

Here are some related articles you may find interesting

Ghidra nanoMIPS ISA module

Sifting through the spines: identifying (potential) Cactus ransomware victims

Public Report – Confidential Mode for Hyperdisk – DEK Protection Analysis

View articles by category

Most popular posts

Most recent posts

Call us before you need us.