Abstract The International Cancer Genome Consortium (ICGC)’s Pan-Cancer Analysis of Whole Genomes (PCAWG) project aimed to categorize somatic and germline variations in both coding and non-coding regions in over 2,800 cancer patients. To provide this dataset to the research working groups for downstream analysis, the PCAWG Technical Working Group marshalled ~800TB of sequencing data from distributed geographical locations; developed portable software for uniform alignment, variant calling, artifact filtering and variant merging; performed the analysis in a geographically and technologically disparate collection of compute environments; and disseminated high-quality validated consensus variants to the working groups. The PCAWG dataset has been mirrored to multiple repositories and can be located using the ICGC Data Portal. The PCAWG workflows are also available as Docker images through Dockstore enabling researchers to replicate our analysis on their own data.
Data commons collate data with cloud computing infrastructure and commonly used software services, tools and applications to create biomedical resources for the large-scale management, analysis, harmonization, and sharing of biomedical data. Over the past few years, data commons have been used to analyze, harmonize and share large scale genomics datasets. Data ecosystems can be built by interoperating multiple data commons. It can be quite labor intensive to curate, import and analyze the data in a data commons. Data lakes provide an alternative to data commons and simply provide access to data, with the data curation and analysis deferred until later and delegated to those that access the data. We review software platforms for managing, analyzing and sharing genomic data, with an emphasis on data commons, but also covering data ecosystems and data lakes.
The majority of pharmacogenomic (PGx) studies have been conducted on European ancestry populations, thereby excluding minority populations and impeding the discovery and translation of African American–specific genetic variation into precision medicine. Without accounting for variants found in African Americans, clinical recommendations based solely on genetic biomarkers found in European populations could result in misclassification of drug response in African American patients. To address these challenges, we formed the Transdisciplinary Collaborative Center ( TCC ), African American Cardiovascular Pharmacogenetic Consortium ( ACCO u NT ), to discover novel genetic variants in African Americans related to clinically actionable cardiovascular phenotypes and to incorporate African American–specific sequence variations into clinical recommendations at the point of care. The TCC consists of two research projects focused on discovery and translation of genetic findings and four cores that support the projects. In addition, the largest repository of PGx information on African Americans is being established as well as lasting infrastructure that can be utilized to spur continued research in this understudied population.
The Earth Observing One (EO-1) satellite was launched in November 2000 as a one year technology demonstration mission for a variety of space technologies. After the first year, it was used as a pathfinder for the creation of SensorWebs. A SensorWeb is the integration of a variety of space, airborne and ground sensors into a loosely coupled collaborative sensor system that automatically provides useful data products. Typically, a SensorWeb is comprised of heterogeneous sensors tied together with an open messaging architecture and web services. SensorWebs provide easier access to sensor data, automated data product production and rapid data product delivery. Disasters are the perfect arena to test SensorWeb functionality since emergency workers and managers need easy and rapid access to satellite, airborne and in-situ sensor data as decision support tools. The Namibia Early Flood Warning SensorWeb pilot project was established to experiment with various aspects of sensor interoperability and SensorWeb functionality. The SensorWeb system features EO-1 data along with other data sets from such satellites as Radarsat, Terra and Aqua. Finally, the SensorWeb team began to examine how to measure economic impact of SensorWeb technology infusion. This paper describes the architecture and software components that were developed along with performance improvements that were experienced. Also, problems and challenges that were encountered are described along with a vision for future enhancements to mitigate some of the problems.
Suppose that a large number of parameterized trajectories (gamma) of a dynamical system evolving in R sup N are stored in a database. Let eta is contained R sup N denote a parameterized path in Euclidean space, and let parallel to center dot parallel to denote a norm on the space of paths. A data structures and indices for trajectories are defined and algorithms are given to answer queries of the following forms: Query 1. Given a path eta, determine whether eta occurs as a subtrajectory of any trajectory gamma from the database. If so, return the trajectory; otherwise, return null. Query 2. Given a path eta, return the trajectory gamma from the database which minimizes the norm parallel to eta - gamma parallel.
With the rise of advanced workflow languages for scientific computations, Nextflow has gained increased attention from the bioinformatics community. Nextflow offers native support for advanced parallelism, which can greatly enhance resource utilization and throughput. Still, a significant portion of bioinformatics workflows are developed with the Common Workflow Language (CWL). Transitioning from CWL to Nextflow poses a significant challenge due to the differences in programming models, scripting language compatibilities, and the prerequisite for in-depth knowledge in both languages. To address this challenge, we present CNT, a novel, semi-automated translator converting CWL workflows into Nextflow ones. At its core, CNT uses an automated translation mechanism that converts the CommandLineTool, the most basic unit of CWL, into Nextflow's Process class. This component integrates tool-level conversion, graph dependency analysis, and correctness checks to provide highly automated translation coverage, significantly reducing the development time while satisfying language-specific requirements like building a proper dataflow model when creating workflows. Furthermore, CNT incorporates a module for aiding manual translation. Specifically, it can identify three common JavaScript patterns in CWL workflows, offering further guidance for developers during the translation phase. We evaluated CNT with production-grade workflows and found that it can cover up to 81% of the original workflows, substantially reducing development time. Additionally, transitioning from a cwltool-based system to Nextflow with CNT can result in a 72% speedup and 85% increased CPU utilization.
We present Global-Local POS tagging, a framework to train generative stochastic Part-of-Speech models on large corpora. Global Taggers offer several advantages over their counter parts trained on small, curated corpus, including the ability to automatically extend and update their models to new text. Global Taggers also avoid a fundamental limitation of current models, whose performance heavily relies on curated text with manually assigned labels. We illustrate our approach by training several Global Taggers, implemented with generative stochastic models, on two large corpora using high performance computing architecture. We further demonstrate that global taggers can be improved by incorporating models trained on curated text, called Local Taggers, for better tagging performance derived from specific topics.