Comparison of sparse biclustering algorithms for gene expression datasets

2020 
Gene clustering and sample clustering are commonly used to find patterns in gene expression datasets. However, in heterogeneous samples (e.g. different tissues or disease states), genes may cluster differently. Biclustering algorithms aim to solve this issue by performing sample clustering and gene clustering simultaneously. Existing reviews of biclustering algorithms have yet to include a number of more recent algorithms and have based comparisons on simplistic simulated datasets without specific evaluation of biclusters in real datasets, using less robust metrics. In this study we compared four classes of sparse biclustering algorithms on a range of simulated and real datasets. In particular we use a knockout mouse RNA-seq dataset to evaluate each algorithms ability to simultaneously cluster genes and cluster samples across multiple tissues. We found that Bayesian algorithms with strict sparsity constraints had high accuracy on the simulated datasets and didnt require any post-processing, but were considerably slower than other algorithm classes. We assessed whether non-negative matrix factorisation algorithms can be repurposed for biclustering and found that, although the raw output was poor, after using a sparsity-inducing post-processing procedure we introduce, one such algorithm was one of the most highly ranked on real datasets. We also exhibit the limitations of biclustering algorithms by varying the complexity of simulated datasets. The algorithms generally struggled on simulated datasets with a large number of implanted factors, or with a large number of genes. In real datasets, the algorithms rarely returned clusters containing samples from multiple tissues, which highlights the need for further thought in the design and analysis of multi-tissue studies to avoid differences between tissues dominating the analysis. Code to run the analysis is available at https://github.com/nichollskc/biclust_comp, including wrappers for each algorithm, implementations of evaluation metrics, and code to simulate datasets and perform pre- and post-processing. The full tables of results are available at https://doi.org/10.5281/zenodo.4317556
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    0
    Citations
    NaN
    KQI
    []