Tags:COVID-19, Kaggle Datasets, Machine Learning, Models Evaluation and Statistics
Abstract:
This study investigates statistical indicators and evaluation metrics for five shallow machine learning models, including one gradient boost method, trained on two COVID-19 datasets. The goal is to find the best-performing model which tracks the progress of COVID-19 statistics. The selected models are K-Nearest Neighbors, Decision Tree (DT), Support Vector Machine, Classification and Regression Trees (CART), and Extreme Gradient Boost (XGBoost). The dataset for the Oceania countries, selected from Kaggle, had 10 observations but was augmented to 110, using the cumulative sum of the values, to have enough training data. The best-performing models for this dataset are CART and DT, with accuracy, sensitivity, specificity, precision, and F1-score above 93.2%. The DT values for mean squared error are between 0.3 and 3.5, while for mean absolute error are less than 1.7. The coefficient of determination is circa 0.8 for CART and DT. The correlation coefficient is not less than 0.81 for DT, and not less than 0.83 for CART. For the dataset with the last 100 countries on Worldometer (total cases until September 2020), the CART and DT models achieve accuracy and specificity above 88%; precision and F1-score are above 78%; sensitivity for DT is around 84%. The graphs and charts, obtained for the DT model, show that mean squared and absolute errors have a value of less than 2. XGBoost is the third-best model. Some models can help discover geographic regions with a risk of infection. CART and DT are the best-performing models for both selected datasets.
Machine Learning Predictive Models Applied on COVID-19 Datasets to Estimate Infection Risk in Specific Geographic Regions