Experiments with Maximin Sampling
2020
To apply clustering algorithms to big data, or to build clustering ensembles, it is a standard process to sample the original data set in a way that hopefully spans the original distribution. There are at least six ways to initialize the Maximin (MM) sampling algorithm. This paper contains experiments to determine whether samples produced by the six methods differ significantly; and whether they are superior to simple random sampling. Empirical evidence supports two conclusions. First, there is not enough difference in MM samples generated by the six initializations to support using any but the least costly method: viz., using the first sample in the data as the first MM point. Second, unless the input data have subsets (clusters) that are compact and separated in a well-defined sense, random sampling is demonstrably superior to MM sampling for even small data sets.
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
15
References
0
Citations
NaN
KQI