Feature Engineering for Malware Detection: Identifying Crucial Static and Dynamic Characteristics from Data to Train Effective Models

EasyChair Preprint 14125

32 pages•Date: July 25, 2024

Abstract

Effective malware detection is crucial in today's increasingly digitized world, where cyber threats pose significant challenges to individuals, organizations, and critical infrastructure. Traditional signature-based detection methods often fall short in identifying novel and polymorphic malware, highlighting the need for more sophisticated approaches. Feature engineering plays a pivotal role in improving the performance of machine learning-based malware detection models by identifying and extracting the most informative characteristics from the data.

This paper presents a comprehensive overview of feature engineering techniques for malware detection, exploring both static and dynamic analysis approaches. On the static analysis front, the study examines file-based features (e.g., file metadata, structure, and content), code-based features (e.g., control flow graphs, call graphs, and static code analysis), and resource-based features (e.g., imported libraries, embedded resources). For dynamic analysis, the focus is on behavioral features, such as system call traces, API call traces, and network traffic analysis, as well as memory-based features and sandbox-based features.

The paper further discusses feature selection and extraction techniques, including correlation analysis, information gain, principal component analysis, and recursive feature elimination, to identify the most crucial characteristics for effective model training. Additionally, it explores various feature representation and encoding methods, such as numeric encoding, one-hot encoding, word embedding, and sequence-to-sequence encoding, to ensure optimal model input.

Keyphrases: API call traces, Engineering Techniques, Polymorphic Malware, network traffic analysis

Links:

https://easychair.org/publications/preprint/JvgM

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:14125,
  author    = {John Owen},
  title     = {Feature Engineering for Malware Detection: Identifying Crucial Static and Dynamic Characteristics from Data to Train Effective Models},
  howpublished = {EasyChair Preprint 14125},
  year      = {EasyChair, 2024}}

Download PDF Open PDF in browser