Categorization of CORD-19 Articles Using Word Embeddings and Topic Models

EasyChair Preprint 4530

10 pages•Date: November 7, 2020

José Antonio Espinosa-Melchor and Jerónimo Arenas-García

Abstract

The outbreak of coronavirus disease 19 (COVID-19), the disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has shaken the world causing a global crisis in a completed unexpected way not seen in years. The rapid spread and its severity have incited scientists all over the world to investigate its causes, symptoms, treatments and effects, resulting in a huge number of publications and articles in just a few months. This overwhelming amount of information complicates access to proper investigations and facilitates the inclusion of non-relevant studies that can delay critical activities. Our goal is to determine the best way to categorize documents, determining which are the ones most relevant to different groups, such as policy-makers or biomedical community, to advance in their investigations, overcoming information overload. We have proposed five classes for a predefined COVID-related corpus (CORD-19), demonstrating that some of the articles included have no connection with the subject, and that the relevance of each paper is highly dependent on the specific area of study. Promising results were obtained making use of a simple model that combines word embeddings, topic modeling, and a Support Vector Classifier.

Keyphrases: COVID-19, Latent Dirichlet Allocation, topic modeling, word embeddings

Links:

https://easychair.org/publications/preprint/BDvN

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:4530,
  author    = {José Antonio Espinosa-Melchor and Jerónimo Arenas-García},
  title     = {Categorization of CORD-19 Articles Using Word Embeddings and Topic Models},
  howpublished = {EasyChair Preprint 4530},
  year      = {EasyChair, 2020}}

Download PDF Open PDF in browser