Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data
2011
Given the pervasiveness of time series data in all human endeavors, and the ubiquity of clustering as a data mining application, it is somewhat surprising that the problem of time series clustering from a single stream remains largely unsolved. Most work on time series clustering considers the clustering of individual time series, e.g., gene expression profiles, individual heartbeats or individual gait cycles. The few attempts at clustering time series streams have been shown to be objectively incorrect in some cases, and in other cases shown to work only on the most contrived datasets by carefully adjusting a large set of parameters. In this work, we make two fundamental contributions. First, we show that the problem definition for time series clustering from streams currently used is inherently flawed, and a new definition is necessary. Second, we show that the Minimum Description Length (MDL) framework offers an efficient, effective and essentially parameter-free method for time series clustering. We show that our method produces objectively correct results on a wide variety of datasets from medicine, zoology and industrial process analyses.
Keywords:
- Machine learning
- Data stream clustering
- Data mining
- Artificial intelligence
- Cluster analysis
- Correlation clustering
- Computer science
- k-medians clustering
- Canopy clustering algorithm
- FLAME clustering
- CURE data clustering algorithm
- Brown clustering
- Fuzzy clustering
- Constrained clustering
- Clustering high-dimensional data
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
30
References
108
Citations
NaN
KQI