Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
2010
Background
Data generated using 'omics' technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of 'omics' data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques.
Keywords:
- Clustering high-dimensional data
- Sample size determination
- Random forest
- Support vector machine
- Statistical classification
- k-nearest neighbors algorithm
- Animal studies
- Bioinformatics
- Curse of dimensionality
- Computer science
- high dimensionality
- background data
- Data mining
- Statistical power
- Classifier (linguistics)
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
36
References
52
Citations
NaN
KQI