Tags:API Sequence, Deep Learning, Fine-tune Language Model and Malware Detection
Abstract:
Application Programming Interfaces (APIs) continue to be the primary and most accessible data source for malware detection and classification methods. With recent Deep Learning (DL) breakthroughs, techniques for analysing API call sequences have become increasingly effective at extracting valuable insights. However, the length and complexity of these sequences can pose challenges, making them difficult to interpret and analyse comprehensively. Furthermore, traditional DL models may struggle to capture long-range dependencies and sequential patterns in such extended API call sequences, which are essential for accurate malware detection. This paper proposes a novel malware detection approach based on API call sequences that leverages the power of the Bidirectional Encoder Representations from Transformers (BERT) and a DL model combining Convolutional Neural Network (CNN) + Extended Long Short-Term Memory (xLSTM) techniques. BERT effectively captures the contextual relationships between API calls, while CNN-xLSTM proves highly effective at classifying sequences by preserving long-term dependencies and handling the complexities of sequential data. Experimental results on the EMBER dataset show that our approach performs better than existing state-of-the-art embedding and detection methods in both accuracy and robustness.
An Approach of Fine-Tuning Language Models and Handling Long Sequences for Efficiently API Call Analysis in Uncovering Windows Malware