Tags:Document clustering, K-means, PCA, Sentence embedding and TF-IDF
Abstract:
Document clustering is the task of organizing textual content into groups so that they are more similar to one another than to those in other groups. Several text clustering algorithms have been proposed recently by various researchers. However, the majority of them limited their research to English-language documents. Odia is the language spoken by the people of odisha and its appearance on the digital platform is on the rise recently. This paper proposes an optimized feature representation using PCA of Odia documents for efficient document clustering. The proposed work first extracts four different features from Odia sentences: word-level TF-IDF, character-level TF-IDF, word-embedding, and sentence embedding vectors. With a Silhouette Coefficient of 0.964, Rand Index of 0.352, Normalized Mutual Information score of 0.001, and Davies-Bouldin Index of 0.022, it was found that the use of PCA-based optimized word-level TF-IDF features performed better than other feature representations.
Optimized Feature Representation for Odia Document Clustering