Tags:COVID-19, Latent Dirichlet Allocation, Topic Modeling and Word Embeddings
Abstract:
The outbreak of coronavirus disease 19 (COVID-19), the disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has shaken the world causing a global crisis in a completed unexpected way not seen in years. The rapid spread and its severity have incited scientists all over the world to investigate its causes, symptoms, treatments and effects, resulting in a huge number of publications and articles in just a few months. This overwhelming amount of information complicates access to proper investigations and facilitates the inclusion of non-relevant studies that can delay critical activities. Our goal is to determine the best way to categorize documents, determining which are the ones most relevant to different groups, such as policy-makers or biomedical community, to advance in their investigations, overcoming information overload. We have proposed five classes for a predefined COVID-related corpus (CORD-19), demonstrating that some of the articles included have no connection with the subject, and that the relevance of each paper is highly dependent on the specific area of study. Promising results were obtained making use of a simple model that combines word embeddings, topic modeling, and a Support Vector Classifier.
Categorization of CORD-19 Articles Using Word Embeddings and Topic Models