A statistical nonparametric method for identifying consistently important features across samples
2020
Abstract In many applications, a consistently high measurement across many samples can indicate particularly meaningful or useful information for quality control or biological interpretation. Identification of these strong features among many others can be challenging especially when the samples cannot be expected to have the same distribution or range of values. We present a general method called conserved feature discovery (CFD) for identifying features with consistently strong signals across multiple conditions or samples. Given any real-valued data, CFD requires no parameters, makes no assumptions on the shape of the underlying sample distributions, and is robust to differences across these distributions.We show that with high probability CFD identifies all true positives and no false positives under certain assumptions on the median and variance distributions of the feature measurements. Using simulated data, we show that CFD is tolerant to a small percentage of poor quality samples and robust to false positives. Applying CFD to RNA sequencing data from the Human Body Map project and GTEx, we identify housekeeping genes as highly expressed genes across tissue types and compare to housekeeping gene lists from previous methods. CFD is consistent between the Human Body Map and GTEx data sets, and identifies lists of genes enriched for basic cellular processes as expected. The framework can be easily adapted for many data types and desired feature properties. Availability Code for CFD and scripts to reproduce the figures and analysis in this work are available at https://github.com/Kingsford-Group/cfd. Supplementary information Supplementary data are available at https://github.com/Kingsford-Group/cfd.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
28
References
0
Citations
NaN
KQI