Novel change detection techniques in multidimensional data mining

2008 
Data mining is the process of automatically discovering useful information in large data repositories. The development of data mining is motivated by the challenges posed by modern data sets, such as large size, high dimensionality and heterogeneity. This thesis proposes several novel data mining methods to discover change detection. The first problem considered is detecting anomalies in a given data set. Anomalies are those data points that are different from the remaining of the data set. In the thesis, a method is proposed to make use of domain knowledge provided by the user. Often, the data include a set of environmental attributes whose values a user would never consider to be directly indicative of an anomaly. However, such attributes cannot be ignored because they have a direct effect on the expected distribution of the result attributes whose values can indicate an anomalous observation. The method proposed in this thesis takes such differences among attributes into account. The second problem considered is detecting change of distribution in multi-dimensional data sets. For a given baseline data set and a set of newly observed data points, the proposed method defines a statistical test for deciding if the observed data points are sampled from the underlying distribution that produced the baseline data set. The method defines a test statistic that is strictly distribution-free under the null hypothesis. The experimental results show that the proposed test has substantially more power than existing methods for multi-dimensional change detection. The third problem considered is modeling the temporal change in prominence of data clusters. Existing work is based on developing a mixture model that treats the time information as one of the random variables, which causes the model to be sensitive to the distribution of time. The proposed method defines a Bayesian mixture model with a set of linear regression mixing proportions that are conditioned on the time. A Gibbs Sampler is used to derive the distributions of the random variables in the model.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    103
    References
    0
    Citations
    NaN
    KQI
    []