Autosomal recessive Gaucher disease (GD) is likely underdiagnosed in many countries. Because the number of diagnosed GD patients in Finland is relatively low, and the true prevalence is currently not known, it was hypothesized that undiagnosed GD patients may exist in Finland. Our previous study demonstrated the applicability of Gaucher Earlier Diagnosis Consensus point-scoring system (GED-C PSS; Mehta et al., 2019) and Finnish biobank data and specimens in the automated point scoring of large populations. An indicative point-score range for Finnish GD patients was determined, but undiagnosed patients were not identified partly due to high number of high-score subjects in combination with a lack of suitable samples for diagnostics in the assessed biobank population. The current study extended the screening to another biobank and evaluated the feasibility of utilising the automated GED-C PSS in conjunction with small nucleotide polymorphism (SNP) chip genotype data from the FinnGen study of biobank sample donors in the identification of undiagnosed GD patients in Finland. Furthermore, the applicability of FFPE tissues and DNA restoration in the next-generation sequencing (NGS) of the GBA gene were tested.Previously diagnosed Finnish GD patients eligible to the study, and up to 45,100 sample donors in Helsinki Biobank (HBB) were point scored. The GED-C point scoring, adjusted to local data, was automated, but also partly manually verified for GD patients. The SNP chip genotype data for rare GBA variants was visually assessed. FFPE tissues of GD patients were obtained from HBB and Biobank Borealis of Northern Finland (BB).Three previously diagnosed GD patients and one patient previously treated for GD-related features were included. A genetic diagnosis was confirmed for the patient treated for GD-related features. The GED-C point score of the GD patients was 12.5-22.5 in the current study. The score in eight Finnish GD patients of the previous and the current study is thus 6-22.5 points per patient. In the automated point scoring of the HBB subpopulation (N ≈ 45,100), the overall scores ranged from 0 to 17.5, with 0.77% (346/45,100) of the subjects having ≥10 points. The analysis of SNP chip genotype data was able to identify the diagnosed GD patients, but potential undiagnosed patients with the GED-C score and/or the GBA genotype indicative of GD were not discovered. Restoration of the FFPE tissue DNA improved the quality of the GBA NGS, and pathogenic GBA variants were confirmed in five out of six unrestored and in all four restored FFPE DNA samples.These findings imply that the prevalence of diagnosed patients (~1:325,000) may indeed correspond the true prevalence of GD in Finland. The SNP chip genotype data is a valuable tool that complements the screening with the GED-C PSS, especially if the genotyping pipeline is tuned for rare variants. These proof-of-concept biobank tools can be adapted to other rare genetic diseases.
Abstract Objective To assess whether electronic health record (EHR) data text mining can be used to improve register-based heart failure (HF) subtyping. EHR data of 43,405 individuals from two Finnish hospital biobanks were mined for unstructured text mentions of ejection fraction (EF) and validated against clinical assessment in two sets of 100 randomly selected individuals. Structured laboratory data was then incorporated for a categorization by HF subtype (HF with mildly reduced EF, HFmrEF; HF with preserved EF, HFpEF; HF with reduced EF, HFrEF; and no HF). Results In 86% of the cases, the algorithm-identified EF belonged to the correct HF subtype range. Sensitivity, specificity, PPV and NPV of the algorithm were 94–100% for HFrEF, 85–100% for HFmrEF, and 96%, 67%, 53% and 98% for HFpEF. Survival analyses using the traditional diagnosis of HF were in concordance with the algorithm-based ones. Compared to healthy individuals, mortality increased from HFmrEF (hazard ratio [HR], 1.91; 95% confidence interval [CI], 1.24–2.95) to HFpEF (2.28; 1.80–2.88) to HFrEF group (2.63; 1.97–3.50) over a follow-up of 1.5 years. We conclude that quantitative EF data can be efficiently extracted from EHRs and used with laboratory data to subtype HF with reasonable accuracy, especially for HFrEF.