Tags:HTTP Traffic, Log Analysis, Malware Detection, NetFlow and Semi-supervised Learning
Abstract:
Machine learning is becoming a key component to automatically detect malware-infected hosts by analyzing network logs in a SOC. However, machine learning usually requires a large amount of labeled training data, which is difficult to acquire since labels are manually set by professional security analysts. On the other hand, abundant unanalyzed logs are kept stored in daily operation. This paper proposes SILU, a novel semi-supervised learning method, which fully leverages unlabeled data and enhances detection capability without increasing manually labeled data. SILU learns from combined labeled and unlabeled training data to automatically augment labeled training data and then generates a classifier through the screening process. Unlike most semi-supervised learning methods used in cyber security, which use test data as unlabeled training data, SILU does not require retraining when test data change since it can use different datasets for unlabeled training and test data. This helps SOC operation for practically suppressing detecting time. In addition, though SILU partially includes a supervised learning method, it does not require a specific supervised learning method. Moreover, SILU can suppress the deterioration of classification performance for test data through the screening process. We evaluated SILU using two types of real-world logs: proxy logs from a large enterprise and NetFlow from a large ISP. We demonstrated that by evaluating with different types of classifiers, SILU always improves detection capability for supervised learning methods. SILU also outperforms current semi-supervised methods. Our evaluation also shows that using NetFlow from ISP as unlabeled training data works better than using only labeled proxy logs in the same enterprise. These results suggest that SILU can extend detection capability more when different organizations, e.g., SOCs and ISPs, collaborate and share unlabeled data.
SILU: Strategy Involving Large-scale Unlabeled Logs for Improving Malware Detector