Tags:Graph Analysis, Latent Dirichlet Allocation, Model Validation, Semantic Similarity, STI topic analysis and Topic Modeling
Abstract:
Latent Dirichlet Allocation (LDA) has become the cornerstone for probabilistic topic modeling of text collections. In the context of science analysis and scientific policy design and monitoring, these tools have been widely used over the last few years to model topic evolution, detect emerging topics, or to analyze lead-lag between data sets. Many datasets related to Science, Technology and Innovation (STI) (scientific articles, patent applications, funding proposals, etc) have been analyzed with these tools and some examples can be found in the last editions of the Global Tech Mining Conference. Apart from offering a thematic overview of a document collection, topic models can also be used to obtain an intermediate representation that can later be used to train automatic classifiers (with respect to specific taxonomies) or to build semantic graphs. However, topic models depend on a set of algorithm hyper-parameters, including in many cases the number of topics, and their selection may significantly affect the quality of the models. Some common procedures to select them rely on coherence definitions and subjective evaluation. In this work, we propose to exploit document graphs based on available metadata for hyperparameter selection, and compare this strategy with topic coherence and topic model stability approaches. Our results on several STI-related data sets show that these strategies provide relevant indicators to build high-quality topic models.
Validation of Scientific Topic Models Using Graph Analysis and Corpus Metadata