MalFinder: an Ensemble Learning-Based Framework for Malicious Traffic Detection

Title:MalFinder: an Ensemble Learning-Based Framework for Malicious Traffic Detection

Authors:Candong Rong, Gaopeng Gou, Mingxin Cui, Gang Xiong, Zhen Li and Li Guo

Conference:IEEE ISCC 2020

Tags:ensemble learning, malicious traffic detection and unseen threat discovery

Abstract:

Malicious events pose a significant threat to the current increasingly interconnected Internet community. Detection based on features of network traffic and machine learning algorithms is a common approach to identify malicious events. The performance of approaches is associated with the used features and algorithms. In this paper, we propose MalFinder, an ensemble learning-based framework for malicious traffic detection. Considering the trend of network traffic encryption and the complexity of decrypting traffic, we utilize statistical features and sequence features to describe network traffic. We extend the dimensions of these two types of features to enhance their capability for representing traffic data. Feature importance analysis and contrast experiments illustrate the effectiveness of our new features. Among our selected classifiers suitable for malicious traffic detection, boosting-based classifiers XGBoost and LightGBM can reduce bias, and bagging-based classifier Random Forest can reduce variance. Stacking, which is the integration method of the classification results used in our framework, can improve the generalization ability of the method. MalFinder can achieve 96.58% F-measure and 95.44% accuracy in the malicious traffic detection task on a real-world dataset, whose results are better than those of comparison methods. In terms of unseen malicious traffic discovery, MalFinder still provides good performance with 93.46% F-measure and 91.04% accuracy, which even surpasses the results in the task of known malicious traffic detection of other comparative methods. With consideration of the scarcity of public data sets used for malicious traffic detection, we have exposed our self-built dataset for more extensive researches.