Text Clustering for Topic Identification: a TF-IDF and K-Means Approach Applied to the 20 Newsgroups Dataset

EasyChair Preprint 13917

5 pages•Date: July 10, 2024

Abstract

In this paper, we present an efficient approach for topic modeling using Term Frequency-Inverse Document Frequency (TF-IDF) and K-means clustering, applied to the 20 Newsgroups dataset. The 20 Newsgroups dataset is a well-known collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups. Our method involves preprocessing the text data to remove noise, calculating the TF-IDF matrix to represent the documents in a high-dimensional space, and employing K-means clustering to group the documents into distinct topics. The effectiveness of the approach is demonstrated through the identification of coherent topic clusters, highlighting the key terms associated with each cluster. This straightforward yet powerful combination of TF-IDF and K-means clustering offers a robust solution for text clustering and topic identification tasks, making it suitable for various natural language processing applications. The results show that our method can effectively uncover the underlying topics within a large text corpus, providing valuable insights for further text analysis and information retrieval.

Keyphrases: 20 news groups, Clusters, K-means, Natural Language Processing, Term frequency Inverse Term Frequency

Links:

https://easychair.org/publications/preprint/C6Ft

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:13917,
  author    = {Marppan Sampath and S Vignesh},
  title     = {Text Clustering for Topic Identification: a TF-IDF and K-Means Approach Applied to the 20 Newsgroups Dataset},
  howpublished = {EasyChair Preprint 13917},
  year      = {EasyChair, 2024}}

Download PDF Open PDF in browser