Objectives: We compared the effects of two semantic terminology models on classification of clinical notes through a study in the domain of heart murmur findings. Methods: One schema was established from the existing SNOMED CT model (S-Model) and the other was from a template model (T-Model) which uses base concepts and non-hierarchical relationships to characterize the murmurs. A corpus of clinical notes (n=309) was collected and annotated using the two schemas. The annotations were coded for a decision tree classifier for text classification task. The standard information retrieval measures of precision, recall, f-score and accuracy and the paired t-test were used for evaluation. Results: The performance of S-Model was better than the original T-Model (p<0.05 for recall and f-score). A revised T-Model by extending its structure and corresponding values performed better than S-Model (p<0.05 for recall and accuracy). Conclusion: We discovered that content coverage is a more important factor than terminology model for classification; however a templatestyle facilitates content gap discovery and completion. Introduction While modern terminologies have advanced well beyond simple one-dimensional subsumption relationships through the introduction of composite expressions, there is an emerging convergence of approaches toward the use of a concept-based clinical terminology with an underlying formal semantic terminology model (STM) [1]. SNOMED CT, the most comprehensive clinically oriented medical terminology system, currently adopts a foundation based on a description logic (DL) model and the underlying DL-based structure to formally represent the meanings of concepts and the interrelationships between concepts [2-3]. The existing SNOMED CT model is mainly pre-coordination oriented, i.e. containing many pre-coordinated terms, and also supports post-coordination. For example, a compositional expression “[ hypophysectomy (52699005) ] + [ transfrontal approach (65519007) ]” could be used to describe a more specific clinical statement than that only using the term “hypophysectomy (52699005)”. For a specific domain, a template model having a semantic structure with a coherent class of terms can be used as a formal representation [4]. This kind of model is mainly post-coordination oriented and a list of atomic terms is organized within a semantic structure. For example, the latest version of the International Classification of Nursing Practice (ICNP) uses a 7-Axis model to support the representation of nursing concepts and integrates the domain concepts of nursing in a manner suitable for computer processing [5]. One of the main goals of the semantic terminology models is to support capturing structured clinical information that is crucial for computer programs such as information retrieval systems and decision support tools [6]. Structured recording has the potential to improve information retrieval from a patient database in response to clinically relevant questions [1]. However, functional difference in retrieval performance has not been clearly demonstrated between these two different semantic terminology models. In this study, we focus upon the specific domain of heart murmur findings. Two schemas were established from two different semantic terminology models for evaluation: one schema is extracted from the existing SNOMED CT model (S-Model) and the other is a template model (T-Model) extracted from a concept-dependent attributes model recently published by Green, et al [7]. The objectives of the study are to annotate the real clinical notes using the two schemas and to compare and evaluate the effects of two models on classification of the clinical notes. Methods and Materials Defining the annotation schemas We defined two schemas for both S-Model and T-Model and represented the two schemas in Protege (version 3.2 beta), which is an ontology editing environment and was developed by Stanford Medical Informatics [8]. For the S-Model, we established a schema by extracting concept trees from the existing sub-hierarchy of heart murmur findings in January 2006 version of SNOMED CT (see Fig. 1). One root concept is “Heart murmur (SCTID_88610006)” which includes 86 sub-concepts of pre-coordinated terms of heart murmur findings. The other root concept is “Anatomical concepts (SCTID_257728006)” which includes two parts relevant to our schema. One part is the concept “Cardiac internal structure (SCTID_277712000)” and its sup-concepts. The other part contains only those anatomical concepts appearing in our clinical notes corpus on the basis of a manual review. For all heart murmur concepts, two semantic attributes derive from SNOMED CT context model for Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds)
Background: Today, there is an increasing need to centralize and standardize electronic health data within clinical research as the volume of data continues to balloon. Domain-specific common data elements (CDEs) are emerging as a standard approach to clinical research data capturing and reporting. Recent efforts to standardize clinical study CDEs have been of great benefit in facilitating data integration and data sharing. The importance of the temporal dimension of clinical research studies has been well recognized; however, very few studies have focused on the formal representation of temporal constraints and temporal relationships within clinical research data in the biomedical research community. In particular, temporal information can be extremely powerful to enable high-quality cancer research.
Semantic MEDLINE provides comprehensive resources with structured annotations that have a potential to facilitate translational studies in the biomedical domain. It is computationally challenging, however, to perform queries directly from the data in the current Semantic MEDLINE database. In this research, we propose a domain pattern driven approach to optimize the Semantic MEDLINE data organization and representation for translational science studies using the Resource Description Framework (RDF) and Semantic Web technologies.
Electronic health records (EHRs) are increasingly used for clinical and translational research through the creation of phenotype algorithms. Currently, phenotype algorithms are most commonly represented as noncomputable descriptive documents and knowledge artifacts that detail the protocols for querying diagnoses, symptoms, procedures, medications, and/or text-driven medical concepts, and are primarily meant for human comprehension. We present desiderata for developing a computable phenotype representation model (PheRM).A team of clinicians and informaticians reviewed common features for multisite phenotype algorithms published in PheKB.org and existing phenotype representation platforms. We also evaluated well-known diagnostic criteria and clinical decision-making guidelines to encompass a broader category of algorithms.We propose 10 desired characteristics for a flexible, computable PheRM: (1) structure clinical data into queryable forms; (2) recommend use of a common data model, but also support customization for the variability and availability of EHR data among sites; (3) support both human-readable and computable representations of phenotype algorithms; (4) implement set operations and relational algebra for modeling phenotype algorithms; (5) represent phenotype criteria with structured rules; (6) support defining temporal relations between events; (7) use standardized terminologies and ontologies, and facilitate reuse of value sets; (8) define representations for text searching and natural language processing; (9) provide interfaces for external software algorithms; and (10) maintain backward compatibility.A computable PheRM is needed for true phenotype portability and reliability across different EHR products and healthcare systems. These desiderata are a guide to inform the establishment and evolution of EHR phenotype algorithm authoring platforms and languages.
Structured Product Labeling (SPL) is a document markup standard approved by Health Level Seven (HL7) and adopted by United States Food and Drug Administration (FDA) as a mechanism for exchanging drug product information. The SPL drug labels contain rich information about FDA approved clinical drugs. However, the lack of linkage to standard drug ontologies hinders their meaningful use. NDF-RT (National Drug File Reference Terminology) and NLM RxNorm as standard drug ontology were used to standardize and profile the product labels. In this paper, we present a framework that intends to map SPL drug labels with existing drug ontologies: NDF-RT and RxNorm. We also applied existing categorical annotations from the drug ontologies to classify SPL drug labels into corresponding classes. We established the classification and relevant linkage for SPL drug labels using the following three approaches. First, we retrieved NDF-RT categorical information from the External Pharmacologic Class (EPC) indexing SPLs. Second, we used the RxNorm and NDF-RT mappings to classify and link SPLs with NDF-RT categories. Third, we profiled SPLs using RxNorm term type information. In the implementation process, we employed a Semantic Web technology framework, in which we stored the data sets from NDF-RT and SPLs into a RDF triple store, and executed SPARQL queries to retrieve data from customized SPARQL endpoints. Meanwhile, we imported RxNorm data into MySQL relational database. In total, 96.0% SPL drug labels were mapped with NDF-RT categories whereas 97.0% SPL drug labels are linked to RxNorm codes. We found that the majority of SPL drug labels are mapped to chemical ingredient concepts in both drug ontologies whereas a relatively small portion of SPL drug labels are mapped to clinical drug concepts. The profiling outcomes produced by this study would provide useful insights on meaningful use of FDA SPL drug labels in clinical applications through standard drug ontologies such as NDF-RT and RxNorm.
e13552 Background: Real-world data from Electronic Health Records (EHR) have been widely used for patient identification to build study cohorts for clinical research. Traditionally, diagnosis codes in the EHR, such International Classification of Diseases (ICD), are used to identify the target patients. However, the accuracy of this approach is dependent on the accuracy of ICD coding, with potential errors especially for tumor types that are frequent locations for metastases (which may contribute to mis-coding). In this study, we attempted to develop a Machine Learning (ML) based approach on EHR data to improve the accuracy of identification of patients with lung cancer. Methods: We used survey respondents in the Enhanced, EHR-facilitated Cancer Symptom Control (E2C2, NCT03892967) cluster-randomized trial at Mayo Clinic as our initial pan-cancer cohort. E2C2 includes adults receiving Medical Hematology/Oncology care for a solid or liquid tumor at Mayo Clinic. We collected cancer diagnoses from the individually abstracted Mayo Clinic Cancer Registry to annotate the cancer type for the patients. Lung cancer related ICD-9 (162.X) and ICD-10 (C34.X) codes were used to build a search query on the Mayo’s EHR to find target patients from the E2C2 cohort, and to investigate the ICD-based lung cancer patient identification performance. Diagnosis, radiation oncology treatment (CPT 77261 - 77799), and antineoplastic drug administration data were collected from EHR as variables. Logistic Regression (LR), Support vector machine (SVM), Random Forest (RF), and eXtreme Gradient Boosting (XGB) were selected to build models for lung cancer patient identification. 10-fold cross-validation was implemented to assess the models. Precision, Recall, F1 Score, and area under the curve (AUC) were selected to measure the performance. Results: We collected 13,893 patients with a specific cancer diagnosis, and 1,394 were identified as having lung cancer. The identification performance across different methods is shown. The ICD-based method only had 0.65 precision. It means we collected a lot of false positive cases (other cancer but no lung cancer patient), as we mentioned in the background. SVM gets the best precision results, but its recall and F1 score are not good enough. XGB shows the best F1 Score and AUC, which also means this method achieved the best and most balanced performance. Conclusions: In this study, we found that XGB-based methods achieved the best identification performance for lung cancer. In the future, we will investigate if this is also true for the identification of other cancer types. [Table: see text]
Precision oncology has the potential to leverage clinical and genomic data in advancing disease prevention, diagnosis, and treatment. A key research area focuses on the early detection of primary cancers and potential prediction of cancers of unknown primary in order to facilitate optimal treatment decisions.This study presents a methodology to harmonize phenotypic and genetic data features to classify primary cancer types and predict cancers of unknown primaries.We extracted genetic data elements from oncology genetic reports of 1011 patients with cancer and their corresponding phenotypical data from Mayo Clinic's electronic health records. We modeled both genetic and electronic health record data with HL7 Fast Healthcare Interoperability Resources. The semantic web Resource Description Framework was employed to generate the network-based data representation (ie, patient-phenotypic-genetic network). Based on the Resource Description Framework data graph, Node2vec graph-embedding algorithm was applied to generate features. Multiple machine learning and deep learning backbone models were compared for cancer prediction performance.With 6 machine learning tasks designed in the experiment, we demonstrated the proposed method achieved favorable results in classifying primary cancer types (area under the receiver operating characteristic curve [AUROC] 96.56% for all 9 cancer predictions on average based on the cross-validation) and predicting unknown primaries (AUROC 80.77% for all 8 cancer predictions on average for real-patient validation). To demonstrate the interpretability, 17 phenotypic and genetic features that contributed the most to the prediction of each cancer were identified and validated based on a literature review.Accurate prediction of cancer types can be achieved with existing electronic health record data with satisfactory precision. The integration of genetic reports improves prediction, illustrating the translational values of incorporating genetic tests early at the diagnosis stage for patients with cancer.
BACKGROUND Multiple types of biomedical associations of knowledge graphs, including COVID-19–related ones, are constructed based on co-occurring biomedical entities retrieved from recent literature. However, the applications derived from these raw graphs (eg, association predictions among genes, drugs, and diseases) have a high probability of false-positive predictions as co-occurrences in the literature do not always mean there is a true biomedical association between two entities. OBJECTIVE Data quality plays an important role in training deep neural network models; however, most of the current work in this area has been focused on improving a model’s performance with the assumption that the preprocessed data are clean. Here, we studied how to remove noise from raw knowledge graphs with limited labeled information. METHODS The proposed framework used generative-based deep neural networks to generate a graph that can distinguish the unknown associations in the raw training graph. Two generative adversarial network models, NetGAN and Cross-Entropy Low-rank Logits (CELL), were adopted for the edge classification (ie, link prediction), leveraging unlabeled link information based on a real knowledge graph built from LitCovid and Pubtator. RESULTS The performance of link prediction, especially in the extreme case of training data versus test data at a ratio of 1:9, demonstrated that the proposed method still achieved favorable results (area under the receiver operating characteristic curve >0.8 for the synthetic data set and 0.7 for the real data set), despite the limited amount of testing data available. CONCLUSIONS Our preliminary findings showed the proposed framework achieved promising results for removing noise during data preprocessing of the biomedical knowledge graph, potentially improving the performance of downstream applications by providing cleaner data.