Tags:categorical variable, class precision recall, classifier achieved high accuracy, Cyberthreats modeling, Data classification, decision tree, high accuracy, machine learning, malicious url, malicious url detection technique, Malicious URL links, malicious web page, missing value, phishing detection, random forest, random forest classifier, testing data, threat detection system and url link
Abstract:
Cyber threat behaviors can take different forms, approaches, and goals. For threat detection systems, it is essential to monitor URLs known for previous malicious attempts. It is also vital to study attack behaviors for the ultimate goal of designing autonomous threat detection systems. We collected a large dataset of URL links annotated toward that goal, manually as benign or malicious links. Several features are collected about those links related to lexicons and structure. Two classification algorithms were employed to extract the best features to predict whether a link is malicious or benign. We were applied three preprocessing techniques, including handling missing values, dealing with outliers, and categorical variables. For two evaluated classifiers, the results' bias is avoided using proper data splitting methods. The quality of the classifiers is evaluated using classifiers' accuracy. The results indicated that Random Forest continuously reported better accuracy than the Decision Tree classifier, which could predict URL link type, whether malicious or not, with the highest accuracy with 98%.
URL Links Malicious Classification Towards Autonomous Threat Detection Systems