Optimization of an RNA-Seq Differential Gene Expression Analysis Depending on Biological Replicate Number and Library Size

2018 
RNA-Seq is now a widespread technology allowing an efficient genome-wide quantification of gene expressions for, among others, DE analysis. After a brief review of the main issues, methods and tools related to the DE analysis of RNA-Seq data, the article focuses on the impact of both replicate number and library size in such analysis. While the main drawback of existing studies on this subject is the lack of generality, we both conducted an analysis of a two-condition experiment (with 8 replicates) in order to compare the results with previous benchmark studies, and a meta-analysis from 17 experiments with up to 18 biological conditions, 8 replicates and 100 million reads per sample. As a global trend, we can conclude that the replicate number has a higher impact than the library size on the power of the DE analysis, except for low expressed genes for which both parameters seem to have the same impact. Beyond global trends, our study brings out new insights for practitioners aiming at enhancing their experimental designs. For instance, by analyzing both the sensitivity and the specificity of the DE analysis, we show that the optimal threshold to control the FDR is approximately equal to 2 exponent -r with r the replicate number. Nonetheless, we show that the FPR is rather well controlled by all three studied R packages: DESeq, DESeq2 and edgeR. We also analyzed the impact of both the replicate number and the library size on the GO enrichment analysis. Interestingly, our study concludes that increasing the replicate number and the library size tends to enhance respectively the sensitivity and the specificity of the GO analysis. We finally recommend to RNA-Seq practitioners the production of a pilot dataset to strictly analyze the power of their experimental design, or the use of a dataset from a public database which should be similar to the dataset they will obtain. For the practitioners of the tomato community, on the basis of the meta-analysis, we recommend at least 4 biological replicates per condition and 20 million reads per sample to be almost sure to obtain about one thousand DE genes.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    69
    References
    33
    Citations
    NaN
    KQI
    []