Experiments with Maximin Sampling

Omar A. Ibrahim,James M. Keller,James C. Bezdek,Mihail Popescu

Experiments with Maximin Sampling

2020

Omar A. Ibrahim
James M. Keller
James C. Bezdek
Mihail Popescu

To apply clustering algorithms to big data, or to build clustering ensembles, it is a standard process to sample the original data set in a way that hopefully spans the original distribution. There are at least six ways to initialize the Maximin (MM) sampling algorithm. This paper contains experiments to determine whether samples produced by the six methods differ significantly; and whether they are superior to simple random sampling. Empirical evidence supports two conclusions. First, there is not enough difference in MM samples generated by the six initializations to support using any but the least costly method: viz., using the first sample in the data as the first MM point. Second, unless the input data have subsets (clusters) that are compact and separated in a well-defined sense, random sampling is demonstrably superior to MM sampling for even small data sets.

Keywords:

Mathematics
Minimax
Sampling (statistics)
Statistics

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations