Abstract Background Although homeobox genes have been the subject of many studies, little is known about the main amino acid changes that occurred early in the evolution of genes belonging to different classes. Results In this study, we report a method for the fast and efficient retrieval of sequences belonging to the ANTP (HOXL and NKL) and PRD classes. Furthermore, we look for diagnostic amino acid residues that can be used to distinguish HOXL, NKL and PRD genes. Conclusion The reported protein features will facilitate the robust classification of homeobox genes from newly sequenced bilaterian genomes. Nevertheless, in non-bilaterian genomes our findings must be cautiously applied. In principle, as long as a good manually curated data set is available the approach here described can be applied to non-bilaterian organisms as well. Our results help focus experimental studies onto investigating the biochemical functions of key homeodomain residues in different gene classes.
Abstract The International Cancer Genome Consortium (ICGC)’s Pan-Cancer Analysis of Whole Genomes (PCAWG) project aimed to categorize somatic and germline variations in both coding and non-coding regions in over 2,800 cancer patients. To provide this dataset to the research working groups for downstream analysis, the PCAWG Technical Working Group marshalled ~800TB of sequencing data from distributed geographical locations; developed portable software for uniform alignment, variant calling, artifact filtering and variant merging; performed the analysis in a geographically and technologically disparate collection of compute environments; and disseminated high-quality validated consensus variants to the working groups. The PCAWG dataset has been mirrored to multiple repositories and can be located using the ICGC Data Portal. The PCAWG workflows are also available as Docker images through Dockstore enabling researchers to replicate our analysis on their own data.
It has been recognized that the development of new therapeutic drugs is a complex and expensive process. A large number of factors affect the activity in vivo of putative candidate molecules and the propensity for causing adverse and toxic effects is recognized as one of the major hurdles behind the current "target-rich, lead-poor" scenario. Structure-Activity Relationship (SAR) studies, using relational Machine Learning (ML) algorithms, have already been shown to be very useful in the complex process of rational drug design. Despite the ML successes, human expertise is still of the utmost importance in the drug development process. An iterative process and tight integration between the models developed by ML algorithms and the know-how of medicinal chemistry experts would be a very useful symbiotic approach. In this paper we describe a software tool that achieves that goal--iLogCHEM. The tool allows the use of Relational Learners in the task of identifying molecules or molecular fragments with potential to produce toxic effects, and thus help in stream-lining drug design in silico. It also allows the expert to guide the search for useful molecules without the need to know the details of the algorithms used. The models produced by the algorithms may be visualized using a graphical interface, that is of common use amongst researchers in structural biology and medicinal chemistry. The graphical interface enables the expert to provide feedback to the learning system. The developed tool has also facilities to handle the similarity bias typical of large chemical databases. For that purpose the user can filter out similar compounds when assembling a data set. Additionally, we propose ways of providing background knowledge for Relational Learners using the results of Graph Mining algorithms.
Abstract Catharanthus roseus leaves produce a range of monoterpenoid indole alkaloids (MIAs) that include low levels of the anticancer drugs vinblastine and vincristine. The MIA pathway displays a complex architecture spanning different subcellular and cell-type localizations and is under complex regulation. As a result, the development of strategies to increase the levels of the anticancer MIAs has remained elusive. The pathway involves mesophyll specialised idioblasts where the late unsolved biosynthetic steps are thought to occur. Here, protoplasts of C. roseus leaf idioblasts were isolated by fluorescence-activated cell sorting, and their differential alkaloid and transcriptomic profiles were characterised. This involved the assembly of an improved C. roseus transcriptome from short- and long-read data, IDIO+. It was observed that C. roseus mesophyll idioblasts possess a distinctive transcriptomic profile associated with protection against biotic and abiotic stresses, and indicative that this cell type is a carbon sink, in contrast with surrounding mesophyll cells. Moreover, it is shown that idioblasts are a hotspot of alkaloid accumulation, suggesting that their transcriptome may hold the keys to the in-depth understanding of the MIA pathway and the success of strategies leading to higher levels of the anticancer drugs. Highlight Catharanthus mesophyll idioblasts are a hotspot of anticancer alkaloid accumulation. The idioblast transcriptome reveals commitment with stress responses and provides a roadmap towards the increase of anticancer alkaloid levels.
Summary Transposable Elements (TE) are sequences of DNA that move and transpose within a genome. TEs, as mutation agents, are quite important for their role in both genome alteration diseases and on species evolution. Several tools have been developed to discover and annotate TEs but no single tool achieves good results on all different types of TEs. In this paper we evaluate the performance of several TEs detection and annotation tools and investigate if Machine Learning techniques can be used to improve their overall detection accuracy. The results of an in silico evaluation of TEs detection and annotation tools indicate that their performance can be improved by using machine learning constructed classifiers.