Tags:clustering, CUDA, GPU, high-dimensional, Mahalanobis distance and parallel
Abstract:
Hierarchical clustering algorithms are common tools for simplifying, exploring and analyzing datasets in many areas of research. In case of flow cytometry, a specific variant of agglomerative clustering has been proposed, that uses cluster linkage based on Mahalanobis distance to produce results better suited for the domain. However, wide applicability of this clustering algorithm is currently limited by its relatively high computational complexity, which does not allow it to scale to common cytometry datasets. This paper proposes an optimized GPU-accelerated version of the Mahalanobis-average linked hierarchical clustering, which improves the algorithm performance by over two orders of magnitude, thus allowing it to scale to much larger datasets. It also provides detailed analysis of used optimizations and experimental results which may be useful for other hierarchical-clustering problems. We have performed benchmarks on publicly available high-dimensional data from flow cytometry which demonstrates applicability of our implementation in the target domain.