CAMLIS 2019: CONFERENCE ON APPLIED MACHINE LEARNING FOR INFORMATION SECURITY
PROGRAM FOR SATURDAY, OCTOBER 26TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:30 Session 5
09:00
Rachel Allen (NVIDIA, United States)
Accelerating The Alert Triage Scenario (AT-ATs): InfoSec Data Science with RAPIDS

ABSTRACT. To keep pace with cyber adversaries, organizations are constantly evolving in their approaches to information monitoring. With the addition of every new alert generated by ML models, heuristics, or sensors comes an additional data feed in need of triage and analysis. SOCs are frequently overwhelmed by the volume of alerts and unable to analyze a large portion of their data, resulting in potentially missed malicious activity. By leveraging the data processing and analytic capabilities of RAPIDS, a suite of open-source software libraries that allow for end-to-end data science pipelines in GPU memory, we demonstrate how it's possible to explore, analyze, and prioritize massive amounts of heterogenous cyber data in real-time.

In this hands-on tutorial, we work through two approaches to data exploration and alert prioritization. First, we demonstrate a RAPIDS' data exploration of model and sensor outputs using common methods for feature engineering, data manipulation, statistical, and trend analysis. Second, we use graph embeddings and Personalized PageRank to prioritize alerts according to criticality of the individuals and infrastructure involved. Attendees will be able to execute code and leave with the tools necessary to create custom workflows in their own security and research environments.

10:00
Konstantin Berlin (Sophos Ltd., United States)
Scalable Infrastructure for Malware Labeling and Analysis

ABSTRACT. One of the best-known secrets of machine learning (ML) is that the most reliable way to get more accurate models is by simply getting more training data and more accurate labels. This observation is also true for malware detection models like the ones we deploy at Sophos. Unfortunately, generating larger, more accurate datasets is arguably a much bigger challenge in the security domain than in most other domains, and poses unique challenges. Malware labeling information is usually not available at time of observation, but comes from various internal and external intelligence feeds, months or even years after a given sample is first observed. Furthermore, labels from these feeds can be inaccurate, incomplete, and even worse, change over time, necessitating joining multiple feeds and frequently adjusting the labeling methodology over entire datasets. Finally, realistic evaluations of an antimalware ML model often require being able to “roll back” to a previous label set, requiring a historical record of the labels at the time of training and evaluation.

All this, under the constraint of a limited budget.

In this presentation, we will show how to use AWS infrastructure to solve the above problems in a fast, efficient, scalable, and affordable manner. The infrastructure we describe can support the data collection and analysis needs of a global Data Science team, all while maintaining GDRP compliance and being able to efficiently export data to edge services that power a global reputation lookup.

We start by describing why developing the above infrastructure at reasonable cost is surprisingly difficult. We focus specifically on the different requirements, such as the need to correlate information from internal and external sources across large time ranges, as well ability to roll back knowledge to particular timeframes in order to properly develop and validate detection models.

Next, we describe how to build such a system in a way that is scalable, agile, and affordable, using AWS cloud infrastructure. We start out describing how to effectively aggregate and batch data across regions into a regional data lake in a way that is GDPR complaint, utilizing SQS, auto-scaling spot instance clusters, and S3 replication. We then describe how we store, join, and analyze this data at scale by batch inserting the replicated data into a distributed columnar Redshift database.

We then describe how we organize the data effectively in Redshift rables, taking proper care to define distribution and sort keys, deal with duplicate entries (as key uniqueness is not enforced), and perform SQL joins efficiently on daily cadence. Finally, we show how we can export the data at scale to local storage for ML training, or propagate this data to edge services, like replicated DynamodDB, to power a global reputation service.

13:00-14:30 Session 6
13:00
Philip Tully (FireEye, United States)
Matthew Haigh (FireEye, United States)
Jay Gibble (FireEye, United States)
Michael Sikorski (FireEye, United States)
Learning to Rank Relevant Malware Strings Using Weak Supervision

ABSTRACT. In static analysis, one of the most useful initial steps is to inspect a binary's printable characters via the Strings program. However, filtering out relevant strings by hand is time consuming and prone to human error. Relevant strings occur disproportionately less often than irrelevant ones, larger binaries produce upwards of thousands of strings that can quickly evoke analyst fatigue, and the definition of "relevant" can vary significantly across individual analysts. Mistakes can lead to missed clues that would have reduced overall time spent performing malware analysis, or even worse, incomplete or incorrect investigatory conclusions.

To address these concerns, we present StringSifter: an open source machine learning-based tool that automatically ranks strings. StringSifter is built to sit downstream from the Strings program; it takes a list of strings as input and returns those same strings ranked according to their relevance for malware analysis as output. StringSifter makes an analyst's life easier, allowing them to focus their attention on only the most relevant strings located towards the top of its predicted output.

StringSifter is trained on a sample of the 3.1 billion individual strings extracted from Strings program outputs of the 400k malicious PE files in the EMBER dataset. Strings are labeled with ordinal ranks obtained from Snorkel, a weak supervision procedure that trains a generative model over SME-derived signatures, i.e. labeling functions, to resolve their underlying conflicts. For each string, Snorkel produces a probabilistic label, which represents a lineage that takes into account the correlation structure of its source labeling functions. This data programming approach allows us to cheaply and rapidly annotate our large corpus of strings and incorporate the subject matter expertise of reverse engineers directly into our model.

Weakly supervised labels together with transformed string features are then fed into several competing discriminative models with learning-to-rank (LTR) loss functions. LTR has historically been applied to problems like web search and recommendation engines, and we evaluate our models in this same lens by using the mean normalized discounted cumulative gain (nDCG). In our presentation, we will discuss how StringSifter helps us achieve generalizable nDCG scores on holdout Strings datasets, and proceed to demonstrate the tool’s predictions live in action on sample binaries. More broadly, we argue for weak supervision as a promising path forward for other label-starved problems plaguing cybersecurity.

13:30
Liam Bressler (SparkCognition, United States)
PowerShell Malware Detection using AMSI

ABSTRACT. Machine learning techniques have revolutionized the area of file-based malware detection, as evidenced by some excellent talks delivered in the last few years. However, fileless attacks present a much different problem for these traditional techniques, and there has been a lack of research in this area of rising importance. This talk will propose new approaches to solving this difficult problem. With Windows 10, Microsoft has introduced the Windows Antimalware Scan Interface (AMSI) to its malware-blocking capabilities. In the presenter’s opinion, this service is underutilized among antivirus programs. The interface’s ability to view as well as deobfuscate all manner of scripts (PowerShell, VBScript, etc.) makes it a powerful tool for extracting script code for analysis. However, AMSI does not output the whole script at once, which frustrates current malware detection machine learning approaches. There are ways to come up with a reasonable solution to script detection, however. Scripts (in particular PowerShell) are often easier to parse than executables (in fact, the PowerShell SDK has a Parser class), so there are very clean features for script machine-learning models. Also, each AMSI chunk can be given a “malicious score”; when the score goes over a certain threshold, the script is stopped. Experiments show that this technique has a surprisingly high efficacy, while not falsely alerting too often.

14:00
Matthew Berninger (FireEye, United States)
TweetSeeker: Extracting Adversary Methods from the Twitterverse

ABSTRACT. Like it or not, Twitter is a useful cybersecurity resource. Every day, cybersecurity practitioners share red team exploits, blue team signatures, malware samples, and many other indicators on Twitter. Users can debate policy issues such as responsible disclosure, intelligence sharing, and nation-state attribution. Connections are made, communities are built, and knowledge is shared.

On the FireEye Advanced Practices Team, our primary mission is to discover and detect advanced adversaries and attack methods. Using Twitter as an intelligence source, we have built an automated framework to help our team focus on actionable cybersecurity information, extracted from the myriad threads and discussions within the “Infosec Twitter” community. This presentation will show the various data science and machine learning methods we are currently using to discover, classify, and present this actionable intelligence to our analysts.

Within this presentation, we will describe how we address two related tasks:

1. Detect and prioritize actionable indicators and warnings for ingest and review by analysts 2. Discover previously unknown sources of intelligence for further collection

We will discuss the various data science concepts that we used for this project, including natural language processing, topic modeling, supervised classification, and graph-based analytics. In addition, we will provide a case study of how our analysts currently use this system to augment our intelligence operations.

We will also describe and demonstrate many of the challenges we have encountered in this research. These include representations of industry-specific terms, Twitter API usage and limitations, dimensionality reduction, and issues related to context. Finally, we will provide lessons learned, next steps, and feedback from front-line analysts using the system.

14:45-16:45 Session 7
14:45
Phil Roth (Endgame, United States)
EMBER Improvements

ABSTRACT. Endgame released an update to the EMBER dataset that includes updated features and an new set of PE files from 2018.

We used a new process for selecting PE files from 2018 to include in the dataset. We were aiming to create a testing set that is more difficult to classify by a machine learning algorithm than the original EMBER 2017 set. We also added steps that eliminated the worst outliers and cut down on duplications in the feature space.

The expanded feature set includes corrections to ordinal import calculations, new features that allow the EMBER classifier to be compared to the Adobe Malware Classifier, and an updated version of LIEF. Features were recalculated using the samples from EMBER 2017 and released. This necessitated versioning the feature calculation and sample selection separately.

I’ll talk about the motivations behind all the changes, what research this expansion enables, and the potential dangers in joining the EMBER 2017 and 2018 samples into a single analysis. I’ll also show the results of some of the different classifiers we’ve trained on EMBER 2018 samples.

15:15
Brian Murphy (ReliaQuest, United States)
An Information Security Approach to Feature Engineering

ABSTRACT. Feature engineering in data science is central to obtaining satisfactory results from deep learning models. When considering how to create features for InfoSec purposes it is important to consider the context of the features and what their underlying meaning is. Common data science techniques such as feature hashing and one-hot encoding, while effective for certain tasks, often fall short when creating features for security related models. This is due to locality sensitivity being often lost.

To address this, we built a set of feature encoders and scalers built specifically for the data types common to information security. In particular we have found that using advanced security focused encoders for IP addresses, usernames, URLs, domain names and geographic information yields dramatically better results than using the naïve encoders commonly employed by data scientists.

This talk expands upon the rationale used to arrive at these methods of encoding and goes into detail on the algorithms used to build these new encoders.

The improvement in prediction results when using these encoders is clearly seen when using a binary classifier trained on labeled data to separate DNS traffic into clean and malicious requests. We see an improvement from approximately 65% accuracy when using basic encoders to over 90% when using the new security focused encoders.

Attendees to this presentation will come away with a new approach to encoding InfoSec features for machine learning that should increase the fidelity of their deep learning models.

15:45
Jared Nishikawa (Carbon Black, United States)
Next Generation Process Emulation with Binee

ABSTRACT. The capability to emulate x86 and other architectures has been around for some time. Malware analysts have several tools readily available in the public domain. These tools, while they are able to load PE files and DLLs, do not emulate system calls, and they may crash or behave strangely when they encounter these calls. In this talk we introduce a new tool: Binee (Binary Emulation Environment), a Windows process emulator. Binee creates a nearly identical Windows process memory model inside the emulator, including all dynamically loaded libraries and other Windows process structures. Unlike previous emulators, Binee mimics much of the OS kernel and outputs a detailed description of all function calls with human readable parameters through the duration of the process.  Furthermore, Binee hooks system calls (which may otherwise have been ignored), which allows developers to specify the desired behavior.  For example, if a sample tries to sleep for 60 seconds, Binee hooks the sleep call, increments an emulated clock, and carries on with emulation, effectively skipping the sleep call.  Malware that uses tricks like these can evade dynamic analysis, especially at scale because of the time cost.  For comparison, WINE, though it can be configured to output similar information as Binee, performs a literal translation to system calls, which is not desirable for two reasons: one, it can be used to slow down analysis (as with the sleep example), and two, executing malware in the cloud with WINE would violate cloud providers' terms of service.

One of the primary benefits Binee provides is data extraction at scale. For the data scientist doing binary analysis, static analysis is the primary source of data in much of the literature. This is due to the cost and complexity required to do static analysis vs dynamic analysis: static analysis is cheap and fast, while dynamic analysis is slow and expensive. Binee offers an increase in useful data extracted but at a cost closer to static analysis tools. Emulation will always take more time than static analysis, but we can cap the time spent on Binee analysis and measure the amount of new data (for example, dynamic imports) extracted.

Currently, our goal is to develop Binee to the point that it will extract useful data for all or most Windows executables, and then use that data to create a much richer feature set than we would get with only static analysis. This would allow us to create, for example, better classifiers for malware.

16:15
Apoorva Joshi (FireEye Inc., United States)
Using Lexical Features for Malicious URL Detection- A Machine Learning Approach

ABSTRACT. Background: Malicious websites are responsible for a majority of the cyber-attacks and scams today. Malicious URLs are delivered to unsuspecting users via email, text messages, pop-ups or advertisements. Clicking on or crawling such URLs can result in compromised email accounts, launching of phishing campaigns, download of malware, spyware and ransomware, as well as severe monetary losses.

Method: A machine learning based ensemble classification approach is proposed to detect malicious URLs in emails, which can be extended to other methods of delivery of malicious URLs. The approach uses static lexical features extracted from the URL string, with the assumption that these features are notably different for malicious and benign URLs. The use of such static features is safer and faster since it does not involve crawling the URLs or lookups which tend to introduce a significant amount of latency in producing verdicts. The dataset consists of a total of ~ 5 million malicious and benign URLs which were obtained from various sources including online feeds like Openphish, Alexa whitelists and internal FireEye databases. A 50-50 split was maintained between malicious and benign URLs so as to have a good representation of both kinds of URLs in the dataset. Compact feature vector representations were generated for the URLs, consisting of 1000 trigram-based features encoded with MurmurHash and 23 lexical features derived from the URL string. The tools used to generate the feature representations were NLTK (a popular NLP Python package), mmh3 (a MurmurHash Python package) and urrlib (a Python library for parsing URLs). The lexical features used for modelling include length of (URL, domain, parameters), number of (dots, delimiters, subdomains, queries) in the URL, presence of suspicious Top Level Domains (TLDs) in the URL, similarity of the domain name to Alexa whitelist domains, to name a few. It was observed that the feature vectors of malicious URL strings so obtained were significantly different from those of benign URL strings. The goal of the classification was to achieve high sensitivity i.e. detect as many malicious URLs as possible. URL strings tend to be very unstructured and noisy. Hence, bagging algorithms were found to be a good fit for the task since they average out multiple learners trained on different parts of the training data, thus reducing variance. Therefore, Random Forest with Decision Tree estimators was used as the machine learning model of choice for classification.

Results: The classification model was tested on five different testing sets, consisting of 200k URLs each. The model produced an average False Negative Rate (FNR) of 0.1%, average accuracy of 92% and average AUC of 0.98. The model is presently being used in the FireEye Advanced URL Detection Engine (used to detect malicious URLs in emails), to generate fast real-time verdicts on URLs. The malicious URL detections from the engine have gone up by 22% since the deployment of the model into the engine workflow.

Conclusion: The results obtained show noteworthy evidence that a purely lexical approach can be used to detect malicious URLs.