Abstract Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations. Since single-cell data are susceptible to technical noise, the quality of genes selected prior to clustering is of crucial importance in the preliminary steps of downstream analysis. Therefore, interest in robust gene selection has gained considerable attention in recent years. We introduce sc-REnF [robust entropy based feature (gene) selection method], aiming to leverage the advantages of $R{\prime}{e}nyi$ and $Tsallis$ entropies in gene selection for single cell clustering. Experiments demonstrate that with tuned parameter ($q$), $R{\prime}{e}nyi$ and $Tsallis$ entropies select genes that improved the clustering results significantly, over the other competing methods. sc-REnF can capture relevancy and redundancy among the features of noisy data extremely well due to its robust objective function. Moreover, the selected features/genes can able to determine the unknown cells with a high accuracy. Finally, sc-REnF yields good clustering performance in small sample, large feature scRNA-seq data. Availability: The sc-REnF is available at https://github.com/Snehalikalall/sc-REnF
Abstract Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations. There are various issues in single cell sequencing that effect homogeneous grouping (clustering) of cells, such as small amount of starting RNA, limited per-cell sequenced reads, cell-to-cell variability due to cell-cycle, cellular morphology, and variable reagent concentrations. Moreover, single cell data is susceptible to technical noise, which affects the quality of genes (or features) selected/extracted prior to clustering. Here we introduce sc-CGconv ( c opula based g raph conv olution network for s ingle cell c lustering), a stepwise robust unsupervised feature extraction and clustering approach that formulates and aggregates cell–cell relationships using copula correlation (Ccor), followed by a graph convolution network based clustering approach. sc-CGconv formulates a cell-cell graph using Ccor that is learned by a graph-based artificial intelligence model, graph convolution network. The learned representation (low dimensional embedding) is utilized for cell clustering. sc-CGconv features the following advantages. a. sc-CGconv works with substantially smaller sample sizes to identify homogeneous clusters. b. sc-CGconv can model the expression co-variability of a large number of genes, thereby outperforming state-of-the-art gene selection/extraction methods for clustering. c. sc-CGconv preserves the cell-to-cell variability within the selected gene set by constructing a cell-cell graph through copula correlation measure. d. sc-CGconv provides a topology-preserving embedding of cells in low dimensional space. The source code and usage information are available at https://github.com/Snehalikalall/CopulaGCN Contact: sumanta.ray@cwi.nl
Single cell RNA sequencing (scRNA-seq) is a powerful tool to capture gene expression snapshots in individual cells. However, a low amount of RNA in the individual cells results in dropout events, which introduce huge zero counts in the single cell expression matrix. We have developed VAImpute, a variational graph autoencoder based imputation technique that learns the inherent distribution of a large network/graph constructed from the scRNA-seq data leveraging copula correlation ( Ccor) among cells/genes. The trained model is utilized to predict the dropouts events by computing the probability of all non-edges (cell-gene) in the network. We devise an algorithm to impute the missing expression values of the detected dropouts. The performance of the proposed model is assessed on both simulated and real scRNA-seq datasets, comparing it to established single-cell imputation methods. VAImpute yields significant improvements to detect dropouts, thereby achieving superior performance in cell clustering, detecting rare cells, and differential expression. All codes and datasets are given in the github link: https://github.com/sumantaray/VAImputeAvailability.
Abstract Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations. Since single cell data is susceptible to technical noise, the quality of genes selected prior to clustering is of crucial importance in the preliminary steps of downstream analysis. Therefore, interest in robust gene selection has gained considerable attention in recent years. We introduce sc-REnF, ( r obust en tropy based f eature (gene) selection method), aiming to leverage the advantages of Rényi and Tsallis> entropies in gene selection for single cell clustering. Experiments demonstrate that with tuned parameter ( q ), Rényi and Tsallis entropies select genes that improved the clustering results significantly, over the other competing methods. sc-REnF can capture relevancy and redundancy among the features of noisy data extremely well due to its robust objective function. Moreover, the selected features/genes can able to clusters the unknown cells with a high accuracy. Finally, sc-REnF yields good clustering performance in small sample, large feature scRNA-seq data.
In this paper, we develop a novel feature selection method called RCFS (Regularized Copula based Feature Selection) based on regularized copula. We use l1 regularization, as it penalizes the redundant co-efficient of features and makes them zero, resulting in non-redundant effective features set. Scale-invariant property of copula ensures good performance in noisy data, thereby improving the stability of the method. Three different forms of copula viz., Gaussian copula, Empirical copula, and Archimedean copula are used with l1 regularization. Results prove a significant improvement in the accuracy of the prediction model than any non regularized feature selection method. The number of optimal features to achieve a fixed accuracy value is also less than any other non regularized feature selection techniques.
Abstract High dimensional, small sample size (HDSS) scRNA-seq data presents a challenge to the gene selection task in single cell. Conventional gene selection techniques are unstable and less reliable due to the fewer number of available samples which affects cell clustering and annotation. Here, we present an improved version of generative adversarial network (GAN) called LSH-GAN to address this issue by producing new realistic samples and combining this with the original scRNA-seq data. We update the training procedure of the generator of GAN using locality sensitive hashing which speeds up the sample generation, thus maintains the feasibility of applying gene selection procedures in high dimension scRNA-seq data. Experimental results show a significant improvement in the performance of benchmark feature (gene) selection techniques on generated samples of one synthetic and four HDSS scRNA-seq data. Comprehensive simulation study ensures the applicability of the model in the feature (gene) selection domain of HDSS scRNA-seq data. Availability The corresponding software is available at https://github.com/Snehalikalall/LSH-GAN
Gene selection in unannotated large single cell RNA sequencing (scRNA-seq) data is important and crucial step in the preliminary step of downstream analysis. The existing approaches are primarily based on high variation (highly variable genes) or significant high expression (highly expressed genes) failed to provide stable and predictive feature set due to technical noise present in the data. Here, we propose RgCop , a novel r e g ularized cop ula based method for gene selection from large single cell RNA-seq data. RgCop utilizes copula correlation ( Ccor ), a robust equitable dependence measure that captures multivariate dependency among a set of genes in single cell expression data. We formulate an objective function by adding l 1 regularization term with Ccor to penalizes the redundant co-efficient of features/genes, resulting non-redundant effective features/genes set. Results show a significant improvement in the clustering/classification performance of real life scRNA-seq data over the other state-of-the-art. RgCop performs extremely well in capturing dependence among the features of noisy data due to the scale invariant property of copula, thereby improving the stability of the method. Moreover, the differentially expressed (DE) genes identified from the clusters of scRNA-seq data are found to provide an accurate annotation of cells. Finally, the features/genes obtained from RgCop is able to annotate the unknown cells with high accuracy.
With the emergence of droplet-based technologies, it has now become possible to profile transcriptomes of several thousands of cells in a day. Although such a large single-cell cohort may favor the discovery of cellular heterogeneity, it also brings new challenges in the prediction of minority cell types. Identification of any minority cell type holds a special significance in knowledge discovery. In the analysis of single-cell expression data, the use of principal component analysis (PCA) is surprisingly frequent for dimension reduction. The principal directions obtained from PCA are usually dominated by the major cell types in the concerned tissue. Thus, it is very likely that using a traditional PCA may endanger the discovery of minority populations. To this end, we propose locality-sensitive PCA (LSPCA), a scalable variant of PCA equipped with structure-aware data sampling at its core. Structure-aware sampling provides PCA with a neutral spread of the data, thereby reducing the bias in its principal directions arising from the redundant samples in a data set. We benchmarked the performance of the proposed method on ten publicly available single-cell expression data sets including one very large annotated data set. Results have been compared with traditional PCA and PCA with random sampling. Clustering results on the annotated data sets also show that LSPCA can detect the minority populations with a higher accuracy.
Abstract A fundamental problem of downstream analysis of scRNA-seq data is the unavailability of enough cell samples compare to the feature size. This is mostly due to the budgetary constraint of single cell experiments or simply because of the small number of available patient samples. Here, we present an improved version of generative adversarial network (GAN) called LSH-GAN to address this issue by producing new realistic cell samples. We update the training procedure of the generator of GAN using locality sensitive hashing which speeds up the sample generation, thus maintains the feasibility of applying the standard procedures of downstream analysis. LSH-GAN outperforms the benchmarks for realistic generation of quality cell samples. Experimental results show that generated samples of LSH-GAN improves the performance of the downstream analysis such as feature (gene) selection and cell clustering.