Subspace Clustering of Very Sparse High-Dimensional Data

Hankui Peng,Nicos G. Pavlidis,Idris A. Eckley,Ioannis Tsalamanis

Subspace Clustering of Very Sparse High-Dimensional Data

2019

Hankui Peng
Nicos G. Pavlidis
Idris A. Eckley
Ioannis Tsalamanis

In this paper we consider the problem of clustering collections of very short texts using subspace clustering. This problem arises in many applications such as product categorisation, fraud detection, and sentiment analysis. The main challenge lies in the fact that the vectorial representation of short texts is both high-dimensional, due to the large number of unique terms in the corpus, and extremely sparse, as each text contains a very small number of words with no repetition. We propose a new, simple subspace clustering algorithm that relies on linear algebra to cluster such datasets. Experimental results on identifying product categories from product names obtained from the US Amazon website indicate that the algorithm can be competitive against state-of-the-art clustering algorithms.

Keywords:

Clustering high-dimensional data
Theoretical computer science
Artificial intelligence
Product (category theory)
Small number
Linear algebra
Machine learning
Subspace topology
Sentiment analysis
Mathematics
Cluster analysis
subspace clustering
Pattern recognition

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations