Approximation Schemes for Clustering with Outliers

Zachary Friggstad,Kamyar Khodamoradi,Mohsen Rezapour,Mohammad R. Salavatipour

Approximation Schemes for Clustering with Outliers

2019

Clustering problems are well studied in a variety of fields, such as data science, operations research, and computer science. Such problems include variants of center location problems, k-median and k-means to name a few. In some cases, not all data points need to be clustered; some may be discarded for various reasons. For instance, some points may arise from noise in a dataset or one might be willing to discard a certain fraction of the points to avoid incurring unnecessary overhead in the cost of a clustering solution. We study clustering problems with outliers. More specifically, we look at uncapacitated facility location (UFL), k-median, and k-means. In these problems, we are given a set X of data points in a metric space δ(., .), a set C of possible centers (each maybe with an opening cost), maybe an integer parameter k, plus an additional parameter z as the number of outliers. In uncapacitated facility location with outliers, we have to open some centers, discard up to z points of X, and assign every other point to the nearest open center, minimizing the total assignment cost plus center opening costs. In k-median and k-means, we have to open up to k centers, but there are no opening costs. In k-means, the cost of assigning j to i is δ2(j, i). We present several results. Our main focus is on cases where δ is a doubling metric (this includes fixed dimensional Euclidean metrics as a special case) or is the shortest path metrics of graphs from a minor-closed family of graphs. For uniform-cost UFL with outliers on such metrics, we show that a multiswap simple local search heuristic yields a PTAS. With a bit more work, we extend this to bicriteria approximations for the k-median and k-means problems in the same metrics where, for any constant e > 0, we can find a solution using (1 + e)k centers whose cost is at most a (1 + e)-factor of the optimum and uses at most z outliers. Our algorithms are all based on natural multiswap local search heuristics. We also show that natural local search heuristics that do not violate the number of clusters and outliers for k-median (or k-means) will have unbounded gap even in Euclidean metrics. Furthermore, we show how our analysis can be extended to general metrics for k-means with outliers to obtain a (25 + e, 1 + e)-approximation: an algorithm that uses at most (1 + e)k clusters and whose cost is at most 25 + e of optimum and uses no more than z outliers.

Keywords:

Correction
Cite
Save
Machine Reading By IdeaReader

References

Citations