Tags:Clustering Categorical Data, Decomposable Dissimilarities, K-Modes and Triangle Inequality
Abstract:
Clustering is an unsupervised machine learning task that aims to discover natural groups in the given dataset. K-Modes, which are adaptions of K-means clustering for continuous data, are among the most popular algorithms for discovering clusters in categorical data. In this paper, we present some first results on how to accelerate them using the triangle inequality, while still always computing exactly the same result as the original K-Modes. We also provide some empirical evidence to illustrate the potential gains provided by leveraging the triangle inequality. Finally, we envision future work aimed at providing a comprehensive understanding of the use of triangle inequality in accelerating (other) clustering algorithms for categorical data.