Estimating expected error rates of random forest classifiers: A comparison of cross-validation and bootstrap

2015 
Statistical learning has recently seen an expansion of applications in different areas of science, finance and industry, as it plays a great role within the fields of statistics, data mining and artificial intelligence. Hence, it intersects with areas of engineering and other disciplines as well. It is used for both regression and classification problems. Solving these problems usually involves building/training a model/classifier and validating its performance for a given task. In this paper we compare two resampling methods for assessment of a random forest classifier: k-fold cross-validation and bootstrap. We use these methods to estimate the generalization error and to create learning curves. Both methods yield similar results on our data. The most important requirement for good generalization error estimates of either method is that the used data sample (i.e. the training dataset) represents the unknown true distribution of the data. This requirement cannot always be met in practice and results of resampling methods have to be interpreted with care if it is violated.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    4
    References
    5
    Citations
    NaN
    KQI
    []