Tags:Classification, Explainability, Exploratory Data Analysis, Lifestyle, Lung Cancer and Machine Learning
Abstract:
Lung cancer is the most common cause of mortality among various types of cancer globally, hence the pursuit of alternative methods for early identification of lung cancer should be a significant consideration, especially in low-resource settings. This work presents an alternative for lung cancer risk recognition based on machine learning. Utilizing lifestyle questionnaire data, we collected, analyzed, visualized, and processed data to train machine learning models. After balancing the data through SMOTE, our proposed model, XGBoost, was fine-tuned using the GridSearchCV method, to which we allocated 70% of our data, leaving the remainder for testing. The results obtained by the model for the first dataset included accuracy and F1 score of 96.50%, with precision and sensitivity of 96.51%. For the second dataset, an accuracy of 95.83%, F1 score of 95.83%, precision of 96.27%, and sensitivity of 95.83% were achieved. Additionally, through LIME, we obtained local interpretability of predictions, thus contributing to a more transparent understanding of model behavior and providing a clearer perspective on machine learning-based solutions.
Prediction of Lung Cancer Risk Through Machine Learning Based on Lifestyle Questionnaire Data