There exist few databases that enable cross-reference among various research fields related to bioenergy. Cross-reference is highly desired among bioinformatics databases related to environment, energy, and agriculture for better mutual cooperation. By uniting Semantic Graph, we can economically construct a distributed database, regardless of the size of research laboratories and research endeavors.Our purpose is to design and develop a workflow based on RDF (Resource Description Framework) that generates Semantic Graph for a set of technical terms extracted from documents of various formats, such as PDF, HTML, and plain text. Our attempt is to generate Semantics Graph as a result of text mining including morphological analysis and syntax analysis.We have developed a prototype of workflow program named "RDF Curator". By using this system, various types of documents can be automatically converted into RDF. "RDF Curator" is composed of general tools and libraries so that no special environment is needed. Hence, “RDF Curator” can be used on many platforms, such as MacOSX, Linux, and Windows (Cygwin). We expect that our system can assist human curators in constructing Semantic Graph. Although fast and high throughput, the accuracy of the present version of "RDF Curator" is lower than that of human curators. As a future task, we have to improve the accuracy of the workflow. In addition, we also plan to apply our system to analysis of network similarity.
Researchers of agriculture, life science and drug design of the need to acquire information that combines two or more life science databases for problem solving. Semantic Web technologies are already necessary for data integration between those databases. This study introduces a technique of utilizing RDF (Resource Description Framework) and OWL (Web Ontology Language) as a data set for development of a machine learning predictor of interactomics. Also, for SPARQL (SPARQL Protocol and RDF Query Language) we sketched the implementing method of interactomics LOD (Linked Open Data) in the graph database. Interactomics LOD has included the pairs of protein--protein interactions of tyrosine kinase, the pairs of amino acid residues of sugar (carbohydrate) binding proteins, and cross-reference information of the protein chain among an entry of major bioscience databases since 2013. Finally, we designed three RDF schema models and made access possible using AllegroGraph 4.11 and Virtuoso 7. The number of total triples was 1,824,859,745 in these databases. It could be combined with public LOD of the life science domain of 28,529,064,366 triples and was able to be searched. We showed that it was realistic to deal with large-scale LOD on a comparatively small budget by this research. The cost cut by LOD decreased not only expense but development time. Especially RDF-SIFTS (Structure Integration with Function, Taxonomy and Sequence) that is an aggregate of 10 small LOD was constructed in the short period of BioHackathon 2013 or was developed in one week. We could say that we can obtain quickly a data set required for the machine learning of interactomics by using LOD. We set up the interactomics LOD for application development as a database. SPARQL endpoints of these databases are exhibited on the portal site UTProt (The University of Tokyo Protein, http://utprot.net).
In recent years, there has been international progress in developing platforms that support the reproducibility and reusability of research data. Typical platforms adopt a service architecture integrating multiple information systems to cover the entire research data lifecycle. In realizing this architecture, specifications for inheriting processes and results executed on different information systems play an essential role. This study introduces our practices for application profile development using ontology technology in the NII Research Data Cloud.
Cervical lymph node metastasis is an important prognostic factor in oral squamous cell carcinoma (OSCC), and preoperative evaluation of cervical lymph nodes requires high diagnostic accuracy. We investigated the usefulness of FDG-PET/contrast-enhanced CT for diagnosing cervical lymph node metastasis in OSCC and determined which procedures could be additionally performed to improve diagnostic accuracy. Between April 2005 and March 2013, a total of 115 patients with OSCC who were treated in the Department of Oral and Maxillofacial Surgery, Dokkyo Medical University Hospital participated in this study. The primary sites of OSCC were the tongue (n = 66), mandibular gingiva (n = 27), maxillary gingiva (n = 10), floor of the mouth (n = 6), and buccal mucosa (n = 6). The clinical stage of the disease was stage I in 10 cases, stage II in 35 cases, stage III in 17 cases, and stage IV in 53 cases. Uptake of FDG was elevated in the cervical lymph nodes of 48 patients, among whom 45 had cervical metastasis (true-positive) and three did not (false-positive). Among 67 patients who did not have elevated FDG uptake, 8 patients had cervical metastasis (false-negative) and 59 patients did not (true-negative). The sensitivity, specificity, and accuracy of FDG-PET at a threshold SUVmax of 2.0 were 84.9%, 95.2%, and 90.4%, respectively. A re-evaluation of patients with negative FDG-PET/contrast-enhanced CT findings together with palpation and MRI increased the diagnostic performance to 93.6%, the sensitivity to 94.5%, and the specificity to 94.1% accuracy.
Abstract There exist few databases that enable cross-reference among various research fields related to bioenergy. Cross-reference is highly desired among bioinformatics databases related to environment, energy, and agriculture for better mutual cooperation. By uniting Semantic Graph, we can economically construct a distributed database, regardless of the size of research laboratories and research endeavors.Our purpose is to design and develop a workflow based on RDF (Resource Description Framework) that generates Semantic Graph for a set of technical terms extracted from documents of various formats, such as PDF, HTML, and plain text. Our attempt is to generate Semantics Graph as a result of text mining including morphological analysis and syntax analysis.We have developed a prototype of workflow program named "RDF Curator". By using this system, various types of documents can be automatically converted into RDF. "RDF Curator" is composed of general tools and libraries so that no special environment is needed. Hence, “RDF Curator” can be used on many platforms, such as MacOSX, Linux, and Windows (Cygwin). We expect that our system can assist human curators in constructing Semantic Graph. Although fast and high throughput, the accuracy of the present version of "RDF Curator" is lower than that of human curators. As a future task, we have to improve the accuracy of the workflow. In addition, we also plan to apply our system to analysis of network similarity.
Real-world data (RWD) have been increasingly used for regulatory decision-making and as a control group for new drug approval applications. RWD is also helpful in understanding information such as risk factors (e.g., pre-existing medical conditions, personal protective equipment, travel, contacts, smoking, and exposure to animals) and vaccination status for the coronavirus disease 2019 (COVID-19). The methodology of utilizing RWD is inconsistent across healthcare institutions. However, there are possible solutions to standardize RWD for clinical data use, which include the use of Clinical Data Interchange Standards Consortium (CDISC) standards, tools, and concepts. This study examines the availability of CDISC and other international standards for the utilization of RWD with concrete examples and presents the potential platform for implementation. We consider the solution currently available to temporarily convert clinical data-warehouse (DWH) data into the Fast Healthcare Interoperability Resources (FHIR) format to comply with the CDISC standard. This approach would allow for converting institution-level standards to national standards as an interim solution until FHIR is supported, mapping national standards to international standards. We believe that the ideal research environment is a data platform that complies with national and international regulations related to RWD applications. Within such a platform, users can share data freely, rather than rely on a specific facility or vendor. Data platform developments are progressing in Japan and globally. In Japan, initiatives to use research data on research data platforms are being conducted. We are experimenting with implementing tools and knowledge shared by CDISC.
Receptor tyrosine kinases are essential proteins involved in cellular differentiation and proliferation in vivo and are heavily involved in allergic diseases, diabetes, and onset/proliferation of cancerous cells. Identifying the interacting partner of this protein, a growth factor ligand, will provide a deeper understanding of cellular proliferation/differentiation and other cell processes. In this study, we developed a method for predicting tyrosine kinase ligand-receptor pairs from their amino acid sequences. We collected tyrosine kinase ligand-receptor pairs from the Database of Interacting Proteins (DIP) and UniProtKB, filtered them by removing sequence redundancy, and used them as a dataset for machine learning and assessment of predictive performance. Our prediction method is based on support vector machines (SVMs), and we evaluated several input features suitable for tyrosine kinase for machine learning and compared and analyzed the results. Using sequence pattern information and domain information extracted from sequences as input features, we obtained 0.996 of the area under the receiver operating characteristic curve. This accuracy is higher than that obtained from general protein-protein interaction pair predictions.