Additional file 3 Visualization of the imputation process. a, c Heatmap of SF and OV lab test data before imputation. b, d Heatmap of SF and OV lab test data after imputation. Black tiles refer to missing entries. Abbreviations: NK, Natural killer cells, Th, T-helper lymphocyte. Ts, T-suppressor lymphocyte. CRP, C reactive protein. PCT, procalcitonin. IFN-γ, interferon-γ. TNF-α, tumor necrosis factor α. IL-1β, interleukin 1β. IL-2R, interleukin 2 receptor. IL-4, interleukin 4. IL-6, interleukin 6. IL-8, interleukin 8. IL-10, interleukin 10. C-IGM, SARS-COV-2 specific antibody IgM. C-IGG, SARS-COV-2 specific antibody IgG. SF, Sino-French New City Campus of Tongji Hospital. OV, Optical Valley Campus of Tongji Hospital.
Abstract Background Single-cell RNA-sequencing (scRNA-seq) is becoming indispensable in the study of cell-specific transcriptomes. However, in scRNA-seq techniques, only a small fraction of the genes are captured due to “dropout” events. These dropout events require intensive treatment when analyzing scRNA-seq data. For example, imputation tools have been proposed to estimate dropout events and de-noise data. The performance of these imputation tools are often evaluated, or fine-tuned, using various clustering criteria based on ground-truth cell subgroup labels. This limits their effectiveness in the cases where we lack cell subgroup knowledge. We consider an alternative strategy which requires the imputation to follow a “self-consistency” principle; that is, the imputation process is to refine its results until there is no internal inconsistency or dropouts from the data. Results We propose the use of “self-consistency” as a main criteria in performing imputation. To demonstrate this principle we devised I-Impute, a “self-consistent” method, to impute scRNA-seq data. I-Impute optimizes continuous similarities and dropout probabilities, in iterative refinements until a self-consistent imputation is reached. On the in silico data sets, I-Impute exhibited the highest Pearson correlations for different dropout rates consistently compared with the state-of-art methods SAVER and scImpute. Furthermore, we collected three wetlab datasets, mouse bladder cells dataset, embryonic stem cells dataset, and aortic leukocyte cells dataset, to evaluate the tools. I-Impute exhibited feasible cell subpopulation discovery efficacy on all the three datasets. It achieves the highest clustering accuracy compared with SAVER and scImpute. Conclusions A strategy based on “self-consistency”, captured through our method, I-Impute, gave imputation results better than the state-of-the-art tools. Source code of I-Impute can be accessed at https://github.com/xikanfeng2/I-Impute .
Dear editor, The coronavirus disease 2019 (COVID-19) is characterized by heterogeneous clinical features and multiple organ damage. Many patients with mild symptoms can suddenly develop into critical illness and progress to a refractory state that has significantly increased mortality, indicating the necessity to promptly identify patients at high risk of physiologic deterioration before the occurrence of critical COVID-19. With both innate and adaptive immune compartments contribution, cytokine storm in covid-19 is widely concerned. Hyperinflammatory response induced by immune dysfunction is reported to underpin critical COVID-19.1 Uncontrolled release of cytokines results in tissue damage and further leads to multiple organ failure, which is the major cause of death in patients with COVID-19.2 As expected, the differences of multiple cytokines and immune features between critical ill and noncritical ill patients were observed in clinical practice. Besides, early seroconversion and high antibody titer were linked with less severe clinical symptoms. Using inflammatory/immune factors to predict the risk of developing critical COVID-19 under the assistance of machine learning (ML) is promising to aid management of the disease, but rarely reported. Electronic health records (EHRs) harbor valuable resources generated from routine medical activities and have been widely used. However, medical data are often complex, multidimensional, nonlinear, heterogeneous, and required to be analyzed using more effective statistical methods than traditional logistic regression. ML is a subfield of artificial intelligence that encapsulates statistical and mathematical algorithms, which enables facts interrogation and complex decision making through a given set of data. The combination of EHRs and ML shows potential applications in predicting the risk of atherosclerotic cardiovascular disease and gestational diabetes. In this multicenter study, we developed an online model with four inflammatory factors (C reactive protein [CRP], tumor necrosis factor α [TNF-α], interleukin 2 receptor [IL-2R], and interleukin 6 [IL-6]) that enabled accurate identification of COVID-19 patients prone to critical illness approximately 20 days in advance. The model was validated in an internal validation cohort (SFV cohort) and an external validation cohort (OV cohort). Study design is presented in Figure 2A. The detailed demographic and characteristics of patients are shown in Table S1. A total of 15 raw inflammatory/immune features were collected from COVID-19 patients at admission. After feature filtering (Figure S1) and data imputation (Figure S2),3 eight features were fitted into Least Absolute Shrinkage and Selection Operator (LASSO)4 logistic regression for feature selection (Figure 1A). As illustrated in Figure 1B, we considered features whose coefficients equaled to zero as redundant and less predictive features. As a result, LASSO analysis identified four features (CRP, TNF-α, IL-2R, and IL-6) for the development of critical illness classifier. We conducted the Spearman correlation analysis between the four features and critical illness status. Figure S3A indicates that the positive correlation at varying degrees existed across five features. The top weighted features, IL-6 (R = 0.49), CRP (R = 0.47), IL-2R (R = 0.43), and TNF-α (R = 0.37), were consisted with previously reported risk factors that were highly correlated with poor outcome of COVID-19. Standard box plots presented significant differences (P < 2.2e-16) of the four features between critically ill and noncritically ill COVID-19 patients (Figure S3B). The median (IQR) expression of TNF-α (17.0, 10.5-29.3, pg/mL), CRP (182.7, 103.2-258.6, mg/L), IL-2R (1447.0, 993.0-2327.5, U/mL), and IL-6 (169.4, 58.1-640.9, pg/mL) was significantly higher in critically ill patients compared with TNF-α (8.1, 6.1-10.5, pg/mL), CRP (16.1, 2.5-57.8, mg/L), IL-2R (520.0, 299.0-770.5, U/mL), and IL-6 (5.0, 2.1-18.4, pg/mL) in noncritically ill patients. During model development stage, five models (support vector machine [SVM], logistic regression [LR], gradient boosted decision tree [GBDT], k-nearest neighbor [KNN], and neural network [NN]) were trained for risk prediction. In general, all five models showed varying but promising critical illness risk prediction performance in the internal and external validation cohorts. CIRPMC (critical illness risk prediction model for COVID-19) derived from SVM achieved the highest predictive performance. Relative feature importance rank of SVM is shown in Figure S4. As binary classifier, CIRPMC outputted critical illness risk probability (P) ranged from 0 to 1 for each patient, and stratified patients with P < .5 as low risk, otherwise high risk. For SFV cohort, CIRPMC achieved an AUC (area under the receiver operating characteristics curve) of 0.946 (95% CI 0.923-0.969) to identify patients having high risk of developing critical illness with an accuracy of 92.7% (95% CI 90.4%-94.6%). For OV cohort, CIRPMC demonstrated an AUC of 0.969 (95% CI 0.945-0.992) and an accuracy of 96.6% (95% CI 95.1-97.7%) (Figure 1C, D). The calibration curve of CIRPMC in two validation cohorts is depicted in Figure S5. Intriguingly, CIRPMC also displayed the minimal Brier score of 0.057 for SFV cohort and 0.028 for OV cohort. All other metrics and the performance of other models are listed in Table 1. With critical illness as status and time from admission to critical illness or discharge as the endpoint, Kaplan-Meier analysis further confirmed the risk stratification ability of the model. CIRPMC robustly stratified high-risk patients and low-risk patients with P < .0001 in both internal and external validation cohorts. The univariate Cox analysis also demonstrated the positive correlation between CIRPMC predicted critical illness subgroup and the ground truth critical illness survival for internal (HR: 22.52, 95% CI 14.69-34.53) and external (HR:54.30, 95% CI 32.21-91.52) validation cohorts, respectively (Figure 1E, F). Additionally, we opened up an online calculator based on CIRPMC to input the values of features needed for risk prediction of COVID-19 patients (https://cirpmc.deepomics.org/). After the clinicians fill in the online form with corresponding features, CIRPMC returns a personalized probability and risk group of critical illness. Illustration of an example of the online prediction system is presented in Figure 2B. In this study, CIRPMC was developed to identify COVID-19 patients with high risk of developing critical illness and achieved high predictive performance with an AUC range from 0.946 to 0.969 across the internal and external validation cohorts. The accurate and rapid risk stratification is critical to ensure health systems agile and hopefully will optimize patient outcomes where "time is life." Working flow of the study. A, Study design. B, Illustration of the online prediction model-CIRPMC. Abbreviations: CIRPMC, critical illness risk prediction model for COVID-19; CRP, C reactive protein; IL-2R, interleukin 2 receptor; IL-6, interleukin 6; OV cohort, external validation cohort of Optical Valley Campus of Tongji Hospital; SFT cohort, training cohort of Sino-French New City Campus of Tongji Hospital; SFV cohort, internal validation cohort of Sino-French New City Campus of Tongji Hospital; TNF-α, tumor necrosis factor α Certain interpretability is a strength of CIRPMC. In accord with previous reports, we found that the expression levels of four contributive inflammatory cytokines (CRP, TNF-α, IL-2R, and IL-6) were significantly higher in critically ill patients than those in noncritically ill patients.5 Another strengths of CIRPMC are its stability and generalizability. Four features used for prediction are readily accessible and frequently monitored in routine clinical practice. Besides, they are relatively objective, solid, and less susceptible to human memory bias, suggesting that CIRPMC is not susceptible to human interference and has strong generalization to be extended to other medical institutions. During the pandemic, there has emerged many studies on prognosis prediction of COVID-19.6-8 However, the sample size of most studies is small, thus harboring risks of overfitting.6, 7 Moreover, most studies lack independent external validation or the number of patients within external validation is limited,9, 10 which can impair the reproducibility and credibility of models. Our study is with larger sample size, independent external validation, detailed patient description, and relatively long observation time (18-20 days). However, the study has some limitations. First, patients included are primarily locals in Wuhan. Data from multiple provinces or countries could further improve the applicability and robustness of models. Besides, the prognostic implication of CIRPMC has not been evaluated in prospective cohorts due to the retrospective nature of this study. In conclusion, this retrospective, multicenter study showed CIRPMC with readily available features holds great potential in accurately and timely (approximately 20 days in advance) identifying COVID-19 patients prone to develop into critical illness. The model held strong stability, generalizability, universality, and wide prediction horizon to be easily extended to areas with limited medical resources. The proposed model potentially assists clinicians to locate the patients with a higher priority to be early intervened and intensively monitored, and eliminate delays to maximize the number of survivors during the rapidly developing global emergency. Equipped with high predictive performance, the online calculator CIRPMC deserves to be proceeded with. However, these findings warrant further validations in prospective clinical trials. We are grateful to all health-care workers and people nationwide and worldwide, who are involved in the fighting against COVID-19. The opinions expressed reflect the collective views of the coauthors. The study was supported by the National Science and Technology Major Sub-Project (2018ZX10301402-002), the Technical Innovation Special Project of Hubei Province (2018ACA138), the National Natural Science Foundation of China (81572570, 81974405, 81772787, 81873452, 81702572, 81702574, and 82072889), Hubei Natural Science Foundation (2019CFB453), and the Fundamental Research Funds for the Central Universities (2017JYCXJJ025, 2018JYCXJJ001, and 2019kfyXMBZ024). The authors have no conflicts of interest to declare. This study was approved by the Research Ethics Commission of Tongji Hospital of Huazhong University of Science and Technology (TJ-IRB20200406) in view of the retrospective nature of the study and all the procedures performed were part of the routine care. The trial has been registered in the Chinese Clinical Trial Registry (ChiCTR2000032161). The informed consents were waived by the Ethics Commission of Tongji Hospital of Huazhong University of Science and Technology. QG had full access to all data in the study, took responsibility for the integrity of data, and the accuracy of the data analysis. YG designed the study. LC did the analysis. YG, LC, and HL interpreted the data and wrote the paper. SZ, XF, YW, TJ, YY, JC, XJ, DL, XF, SW, RY, YY, SX, XX, PC, QM, XJ, and YW provided patients' samples and clinical data, entered the data into database, and double-checked the data. QG, SL, CL, and DM advised on the conception and design of the study. All authors vouched for the respective data and analysis, approved the final version, and agreed to publish the manuscript. The data contain information that could compromise research participant privacy, and so are not publicly available. Data supporting the findings of this study are available from the corresponding author upon reasonable request. Supplement Methods. Figure S1: Visualization of the denosing and filtering process. Figure S2: Visualization of the imputation process. Figure S3: Statistical analysis of four features selected by Lasso. Figure S4: Relative feature importance of SVM model. Figure S5: Calibration curves of SVM model in cohorts. Table S1: Baseline characteristics of individuals by cohorts. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
Abstract Temperate phages (active prophages induced from bacteria) help control pathogenicity, modulate community structure, and maintain gut homeostasis. Complete phage genome sequences are indispensable for understanding phage biology. Traditional plaque techniques are inapplicable to temperate phages due to their lysogenicity, curbing their identification and characterization. Existing bioinformatics tools for prophage prediction usually fail to detect accurate and complete temperate phage genomes. This study proposes a novel computational temperate phage detection method (TemPhD) mining both the integrated active prophages and their spontaneously induced forms (temperate phages) from next-generation sequencing raw data. Applying the method to the available dataset resulted in 192 326 complete temperate phage genomes with different host species, expanding the existing number of complete temperate phage genomes by more than 100-fold. The wet-lab experiments demonstrated that TemPhD can accurately determine the complete genome sequences of the temperate phages, with exact flanking sites, outperforming other state-of-the-art prophage prediction methods. Our analysis indicates that temperate phages are likely to function in the microbial evolution by (i) cross-infecting different bacterial host species; (ii) transferring antibiotic resistance and virulence genes and (iii) interacting with hosts through restriction-modification and CRISPR/anti-CRISPR systems. This work provides a comprehensively complete temperate phage genome database and relevant information, which can serve as a valuable resource for phage research.
Recently, the prevalence and importance of RNA editing have been illuminated in mammals. However, studies on RNA editing of pigs, a widely used biomedical model animal, are rare. Here we collected RNA sequencing data across 11 tissues and identified more than 490,000 RNA editing sites. We annotated their biological features, detected flank sequence characteristics of A-to-I editing sites and the impact of A-to-I editing on miRNA–mRNA interactions, and identified RNA editing quantitative trait loci (edQTL). Sus scrofa RNA editing sites showed high enrichment in repetitive regions with a median editing level as 15.38%. Expectedly, 96.3% of the editing sites located in non-coding regions including intron, 3′ UTRs, intergenic, and gene proximal regions. There were 2233 editing sites located in the coding regions and 980 of them caused missense mutation. Our results indicated that to an A-to-I editing site, the adjacent four nucleotides, two before it and two after it, have a high impact on the editing occurrences. A commonly observed editing motif is CCAGG. We found that 4552 A-to-I RNA editing sites could disturb the original binding efficiencies of miRNAs and 4176 A-to-I RNA editing sites created new potential miRNA target sites. In addition, we performed edQTL analysis and found that 1134 edQTLs that significantly affected the editing levels of 137 RNA editing sites. Finally, we constructed PRESDB, the first pig RNA editing sites database. The site provides necessary functions associated with Sus scrofa RNA editing study.
ABSTRACT The advances of single-cell DNA sequencing (scDNA-seq) enable us to characterize the genetic heterogeneity of cancer cells. However, the high noise and low coverage of scDNA-seq impede the estimation of copy number variations (CNVs). In addition, existing tools suffer from intensive execution time and often fail on large datasets. Here, we propose SeCNV, a novel method that leverages structural entropy, to profile the copy numbers. SeCNV adopts a local Gaussian kernel to construct a matrix, depth congruent map , capturing the similarities between any two bins along the genome. Then SeCNV partitions the genome into segments by minimizing the structural entropy from the depth congruent map. With the partition, SeCNV estimates the copy numbers within each segment for cells. We simulate nine datasets with various breakpoint distributions and amplitudes of noise to benchmark SeCNV. SeCNV achieves a robust performance, i.e., the F1-scores are higher than 0.95 for breakpoint detections, significantly outperforming state-of-the-art methods. SeCNV successfully processes large datasets (>50,000 cells) within four minutes while other tools failed to finish within the time limit, i.e., 120 hours. We apply SeCNV to single-nucleus sequencing (SNS) datasets from two breast cancer patients and acoustic cell tagmentation (ACT) sequencing datasets from eight breast cancer patients. SeCNV successfully reproduces the distinct subclones and infers tumor heterogeneity. SeCNV is available at https://github.com/deepomicslab/SeCNV .
Accurately identifying gene regulatory network is an important task in understanding in vivo biological activities. The inference of such networks is often accomplished through the use of gene expression data. Many methods have been developed to evaluate gene expression dependencies between transcription factor and its target genes, and some methods also eliminate transitive interactions. The regulatory (or edge) direction is undetermined if the target gene is also a transcription factor. Some methods predict the regulatory directions in the gene regulatory networks by locating the eQTL single nucleotide polymorphism, or by observing the gene expression changes when knocking out/down the candidate transcript factors; regrettably, these additional data are usually unavailable, especially for the samples deriving from human tissues. In this study, we propose the Context Based Dependency Network (CBDN), a method that is able to infer gene regulatory networks with the regulatory directions from gene expression data only. To determine the regulatory direction, CBDN computes the influence of source to target by evaluating the magnitude changes of expression dependencies between the target gene and the others with conditioning on the source gene. CBDN extends the data processing inequality by involving the dependency direction to distinguish between direct and transitive relationship between genes. We also define two types of important regulators which can influence a majority of the genes in the network directly or indirectly. CBDN can detect both of these two types of important regulators by averaging the influence functions of candidate regulator to the other genes. In our experiments with simulated and real data, even with the regulatory direction taken into account, CBDN outperforms the state-of-the-art approaches for inferring gene regulatory network. CBDN identifies the important regulators in the predicted network: 1. TYROBP influences a batch of genes that are related to Alzheimer’s disease; 2. ZNF329 and RB1 significantly regulate those ‘mesenchymal’ gene expression signature genes for brain tumors. By merely leveraging gene expression data, CBDN can efficiently infer the existence of gene-gene interactions as well as their regulatory directions. The constructed networks are helpful in the identification of important regulators for complex diseases.
Adenosine-to-inosine RNA editing can markedly diversify the transcriptome, leading to a variety of critical molecular and biological processes in mammals. Over the past several years, researchers have developed several new pipelines and software packages to identify RNA editing sites with a focus on downstream statistical analysis and functional interpretation.Here, we developed a user-friendly public webserver named MIRIA that integrates statistics and visualization techniques to facilitate the comprehensive analysis of RNA editing sites data identified by the pipelines and software packages. MIRIA is unique in that provides several analytical functions, including RNA editing type statistics, genomic feature annotations, editing level statistics, genome-wide distribution of RNA editing sites, tissue-specific analysis and conservation analysis. We collected high-throughput RNA sequencing (RNA-seq) data from eight tissues across seven species as the experimental data for MIRIA and constructed an example result page.MIRIA provides both visualization and analysis of mammal RNA editing data for experimental biologists who are interested in revealing the functions of RNA editing sites. MIRIA is freely available at https://mammal.deepomics.org.