Structural variations (SVs) are wide-spread in human genomes and may have important implications in disease-related and evolutionary studies. High-throughput sequencing (HTS) has become a major platform for SV detection and simulation serves as a powerful and cost-effective approach for benchmarking SV detection algorithms. Accurate performance assessment by simulation requires the simulator capable of generating simulation data with all important features of real data, such GC biases in HTS data and various complexities in tumor data. However, no available package has systematically addressed all issues in data simulation for SV benchmarking.Pysim-sv is a package for simulating HTS data to evaluate performance of SV detection algorithms. Pysim-sv can introduce a wide spectrum of germline and somatic genomic variations. The package contains functionalities to simulate tumor data with aneuploidy and heterogeneous subclones, which is very useful in assessing algorithm performance in tumor studies. Furthermore, Pysim-sv can introduce GC-bias, the most important and prevalent bias in HTS data, in the simulated HTS data.Pysim-sv provides an unbiased toolkit for evaluating HTS-based SV detection algorithms.
With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.
Tiling array is one type production of Affymetrix short oligonucleotide microarrays to map transcriptional factor binding sites (TFBS) at a high resolution. In order to estimate the quantity of target sequences of probes, we model the binding behavior between probes and DNA sequences at molecular interaction level and term it hybridization model for tiling array analysis (HMT). In our model, the binding behavior is affected by two main factors: the concentration of DNA fragments and the binding affinity of one probe with its target and off-target sequences. HMT mainly models the influence of binding affinity and gains the relative short DNA fragment abundance effectively. Results of comparison with another popular method suggest that HMT characterizes the hybridization mechanism of tiling array significantly. Another advantage of our model is that it is based on only one array and therefore can be easily extended to multiple arrays as well as other oligonucleotide array platforms.
Multifocal tumors developed either as independent tumors or as intrahepatic metastases, are very common in primary liver cancer. However, their molecular pathogenesis remains elusive. Herein, a patient with synchronous two hepatocellular carcinoma (HCC, designated as HCC-A and HCC-B) and one intrahepatic cholangiocarcinoma (ICC), as well as two postoperative recurrent tumors, was enrolled. Multiregional whole-exome sequencing was applied to these tumors to delineate the clonality and heterogeneity. The three primary tumors showed almost no overlaps in mutations and copy number variations. Within each tumor, multiregional sequencing data showed varied intratumoral heterogeneity (21.6% in HCC-A, 20.4% in HCC-B, 53.2% in ICC). The mutational profile of two recurrent tumors showed obvious similarity with HCC-A (86.7% and 86.6% respectively), rather than others, indicating that they originated from HCC-A. The evolutionary history of the two recurrent tumors indicated that intrahepatic micro-metastasis could be an early event during HCC progression. Notably, FAT4 was the only gene mutated in two primary HCCs and the recurrences. Mutation prevalence screen and functional experiments showed that FAT4, harboring somatic coding mutations in 26.7% of HCC, could potently inhibit growth and invasion of HCC cells. In HCC patients, both FAT4 expression and FAT4 mutational status significantly correlated with patient prognosis. Together, our findings suggest that spatial and temporal dissection of genomic alterations during the progression of multifocal liver cancer may help to elucidate the basis for its dismal prognosis. FAT4 acts as a putative tumor suppressor that is frequently inactivated in human HCC.
Abstract Identification of cancer-related genes is helpful for understanding the pathogenesis of cancer, developing targeted drugs and creating new diagnostic and therapeutic methods. Considering the complexity of the biological laboratory methods, many network-based methods have been proposed to identify cancer-related genes at the global perspective with the increasing availability of high-throughput data. Some studies have focused on the tissue-specific cancer networks. However, cancers from different tissues may share common features, and those methods may ignore the differences and similarities across cancers during the establishment of modeling. In this work, in order to make full use of global information of the network, we first establish the pan-cancer network via differential network algorithm, which not only contains heterogeneous data across multiple cancer types but also contains heterogeneous data between tumor samples and normal samples. Second, the node representation vectors are learned by network embedding. In contrast to ranking analysis-based methods, with the help of integrative network analysis, we transform the cancer-related gene identification problem into a binary classification problem. The final results are obtained via ensemble classification. We further applied these methods to the most commonly used gene expression data involving six tissue-specific cancer types. As a result, an integrative pan-cancer network and several biologically meaningful results were obtained. As examples, nine genes were ultimately identified as potential pan-cancer-related genes. Most of these genes have been reported in published studies, thus showing our method’s potential for application in identifying driver gene candidates for further biological experimental verification.
Long non-coding RNAs (LncRNAs) are usually longer than 200 nucleotides in length, which have 4 functions including signal, decoy, guide and scaffold in cells. Many studies have shown that the expression of LncRNAs is differentially between normal tissues and cancers, including hepatocellular carcinoma (HCC), bladder cancer, epithelial ovarian cancer, gastric cancer and other cancers, which suggests that alterations in the expression of LncRNAs could promote or inhibit tumor growth. However, the exact mechanisms by which LncRNAs play their roles in the normal and tumor cells remain unclear. This article reviews the role of LncRNAs in some common human carcinomas especially in HCC.
Etiologic diagnoses of lower respiratory tract infections (LRTI) have been relying primarily on bacterial cultures that often fail to return useful results in time. Although DNA-based assays are more sensitive than bacterial cultures in detecting pathogens, the molecular results are often inconsistent and challenged by doubts on false positives, such as those due to system- and environment-derived contaminations. Here we report a nationwide cohort study on 2986 suspected LRTI patients across P. R. China. We compared the performance of a DNA-based assay qLAMP (quantitative Loop-mediated isothermal AMPlification) with that of standard bacterial cultures in detecting a panel of eight common respiratory bacterial pathogens from sputum samples. Our qLAMP assay detects the panel of pathogens in 1047(69.28%) patients from 1533 qualified patients at the end. We found that the bacterial titer quantified based on qLAMP is a predictor of probability that the bacterium in the sample can be detected in culture assay. The relatedness of the two assays fits a logistic regression curve. We used a piecewise linear function to define breakpoints where latent pathogen abruptly change its competitive relationship with others in the panel. These breakpoints, where pathogens start to propagate abnormally, are used as cutoffs to eliminate the influence of contaminations from normal flora. With help of the cutoffs derived from statistical analysis, we are able to identify causative pathogens in 750 (48.92%) patients from qualified patients. In conclusion, qLAMP is a reliable method in quantifying bacterial titer. Despite the fact that there are always latent bacteria contaminated in sputum samples, we can identify causative pathogens based on cutoffs derived from statistical analysis of competitive relationship. Trial Registration ClinicalTrials.gov NCT00567827