Whole-genome bisulfite sequencing (WGBS) is the standard method for profiling DNA methylation at single-nucleotide resolution. Different tools have been developed to extract differentially methylated regions (DMRs), often built upon assumptions from mammalian data. Here, we present MethylScore, a pipeline to analyse WGBS data and to account for the substantially more complex and variable nature of plant DNA methylation. MethylScore uses an unsupervised machine learning approach to segment the genome by classification into states of high and low methylation. It processes data from genomic alignments to DMR output and is designed to be usable by novice and expert users alike. We show how MethylScore can identify DMRs from hundreds of samples and how its data-driven approach can stratify associated samples without prior information. We identify DMRs in the A. thaliana 1,001 Genomes dataset to unveil known and unknown genotype–epigenotype associations .
We have conducted a study on the long-term availability of bioinformatics Web services: an observation of 927 Web services published in the annual Nucleic Acids Research Web Server Issues between 2003 and 2009. We found that 72% of Web sites are still available at the published addresses, only 9% of services are completely unavailable. Older addresses often redirect to new pages. We checked the functionality of all available services: for 33%, we could not test functionality because there was no example data or a related problem; 13% were truly no longer working as expected; we could positively confirm functionality only for 45% of all services. Additionally, we conducted a survey among 872 Web Server Issue corresponding authors; 274 replied. 78% of all respondents indicate their services have been developed solely by students and researchers without a permanent position. Consequently, these services are in danger of falling into disrepair after the original developers move to another institution, and indeed, for 24% of services, there is no plan for maintenance, according to the respondents. We introduce a Web service quality scoring system that correlates with the number of citations: services with a high score are cited 1.8 times more often than low-scoring services. We have identified key characteristics that are predictive of a service's survival, providing reviewers, editors, and Web service developers with the means to assess or improve Web services. A Web service conforming to these criteria receives more citations and provides more reliable service for its users. The most effective way of ensuring continued access to a service is a persistent Web address, offered either by the publishing journal, or created on the authors' own initiative, for example at http://bioweb.me. The community would benefit the most from a policy requiring any source code needed to reproduce results to be deposited in a public repository.
Major advances in selection progress for cattle have been made following the introduction of genomic tools over the past 10-12 years. These tools depend upon the Bos taurus reference genome (UMD3.1.1), which was created using now-outdated technologies and is hindered by a variety of deficiencies and inaccuracies.We present the new reference genome for cattle, ARS-UCD1.2, based on the same animal as the original to facilitate transfer and interpretation of results obtained from the earlier version, but applying a combination of modern technologies in a de novo assembly to increase continuity, accuracy, and completeness. The assembly includes 2.7 Gb and is >250× more continuous than the original assembly, with contig N50 >25 Mb and L50 of 32. We also greatly expanded supporting RNA-based data for annotation that identifies 30,396 total genes (21,039 protein coding). The new reference assembly is accessible in annotated form for public use.We demonstrate that improved continuity of assembled sequence warrants the adoption of ARS-UCD1.2 as the new cattle reference genome and that increased assembly accuracy will benefit future research on this species.
Whole-genome bisulfite sequencing (WGBS) is the standard method for profiling DNA methylation at single-nucleotide resolution. Different tools have been developed to extract differentially methylated regions (DMRs), often built upon assumptions from mammalian data. Here, we present MethylScore, a pipeline to analyse WGBS data and to account for the substantially more complex and variable nature of plant DNA methylation. MethylScore uses an unsupervised machine learning approach to segment the genome by classification into states of high and low methylation. It processes data from genomic alignments to DMR output and is designed to be usable by novice and expert users alike. We show how MethylScore can identify DMRs from hundreds of samples and how its data-driven approach can stratify associated samples without prior information. We identify DMRs in the A. thaliana 1,001 Genomes dataset to unveil known and unknown genotype–epigenotype associations .
Whole-genome bisulfite sequencing (WGBS) is the standard method for profiling DNA methylation at single-nucleotide resolution. Different tools have been developed to extract differentially methylated regions (DMRs), often built upon assumptions from mammalian data. Here, we present MethylScore, a pipeline to analyse WGBS data and to account for the substantially more complex and variable nature of plant DNA methylation. MethylScore uses an unsupervised machine learning approach to segment the genome by classification into states of high and low methylation. It processes data from genomic alignments to DMR output and is designed to be usable by novice and expert users alike. We show how MethylScore can identify DMRs from hundreds of samples and how its data-driven approach can stratify associated samples without prior information. We identify DMRs in the A. thaliana 1,001 Genomes dataset to unveil known and unknown genotype–epigenotype associations .
Motivation: Understanding transcriptional regulation is one of the main challenges in computational biology. An important problem is the identification of transcription factor (TF) binding sites in promoter regions of potential TF target genes. It is typically approached by position weight matrix-based motif identification algorithms using Gibbs sampling, or heuristics to extend seed oligos. Such algorithms succeed in identifying single, relatively well-conserved binding sites, but tend to fail when it comes to the identification of combinations of several degenerate binding sites, as those often found in cis-regulatory modules.
Results: We propose a new algorithm that combines the benefits of existing motif finding with the ones of support vector machines (SVMs) to find degenerate motifs in order to improve the modeling of regulatory modules. In experiments on microarray data from Arabidopsis thaliana, we were able to show that the newly developed strategy significantly improves the recognition of TF targets.
Availability: The python source code (open source-licensed under GPL), the data for the experiments and a Galaxy-based web service are available at http://www.fml.mpg.de/raetsch/suppl/kirmes/
Contact: ed.gpm.negnibeut@ibes
Supplementary information: Supplementary data are available at Bioinformatics online.
Abstract Background Assessment of seed germination is an essential task for seed researchers to measure the quality and performance of seeds. Usually, seed assessments are done manually, which is a cumbersome, time consuming and error-prone process. Classical image analyses methods are not well suited for large-scale germination experiments, because they often rely on manual adjustments of color-based thresholds. We here propose a machine learning approach using modern artificial neural networks with region proposals for accurate seed germination detection and high-throughput seed germination experiments. Results We generated labeled imaging data of the germination process of more than 2400 seeds for three different crops, Zea mays (maize), Secale cereale (rye) and Pennisetum glaucum (pearl millet) , with a total of more than 23,000 images. Different state-of-the-art convolutional neural network (CNN) architectures with region proposals have been trained using transfer learning to automatically identify seeds within petri dishes and to predict whether the seeds germinated or not. Our proposed models achieved a high mean average precision (mAP) on a hold-out test data set of approximately 97.9%, 94.2% and 94.3% for Zea mays , Secale cereale and Pennisetum glaucum respectively. Further, various single-value germination indices, such as Mean Germination Time and Germination Uncertainty, can be computed more accurately with the predictions of our proposed model compared to manual countings. Conclusion Our proposed machine learning-based method can help to speed up the assessment of seed germination experiments for different seed cultivars. It has lower error rates and a higher performance compared to conventional and manual methods, leading to more accurate germination indices and quality assessments of seeds.