Randomization of real-valued matrices for assessing the significance of data mining results

2008 
Randomization is an important technique for assessing the significance of data mining results. Given an input data set, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is then compared to the measure on the samples to assess its significance. For certain types of data, e.g., gene expression matrices, it is useful to be able to sample datasets that share row and column means and variances. Testing whether the results of a data mining algorithm on such randomized datasets differ from the results on the true dataset tells us whether the results on the true data were an artifact of the row and column means and variances, or due to some more interesting phenomena in the data. In this paper, we study the problem of generating such randomized datasets. We describe three alternative algorithms based on local transformations and Metropolis sampling, and show that the methods are efficient and usable in practice. We evaluate the performance of the methods both on real and generated data. The results indicate that the methods work efficiently and solve the defined problem.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    51
    References
    23
    Citations
    NaN
    KQI
    []