In this paper, we propose privacy-enhancing technologies for medical tests and personalized medicine methods that use patients' genomic data. Focusing on genetic disease-susceptibility tests, we develop a new architecture (between the patient and the medical unit) and propose a "privacy-preserving disease susceptibility test" (PDS) by using homomorphic encryption and proxy re-encryption. Assuming the whole genome sequencing to be done by a certified institution, we propose to store patients' genomic data encrypted by their public keys at a "storage and processing unit" (SPU). Our proposed solution lets the medical unit retrieve the encrypted genomic data from the SPU and process it for medical tests and personalized medicine methods, while preserving the privacy of patients' genomic data. We also quantify the genomic privacy of a patient (from the medical unit's point of view) and show how a patient's genomic privacy decreases with the genetic tests he undergoes due to (i) the nature of the genetic test, and (ii) the characteristics of the genomic data. Furthermore, we show how basic policies and obfuscation methods help to keep the genomic privacy of a patient at a high level. We also implement and show, via a complexity analysis, the practicality of PDS.
Abstract Purpose Swiss BioRef is a nation-wide multicentre infrastructure project, the aim of which is to become a sustainable framework for the estimation and assessment of patient-group-specific reference intervals in laboratory medicine and beyond. In this unprecedented effort, nation-wide multidimensional data from multiple clinical laboratory databases has been combined under the common interoperable semantic framework of the Swiss Personalized Health Network (SPHN) initiative. The consolidated effort enables creating extremely detailed patient group-specific queries via intuitive web applications, allowing the generation of individualised, covariate adjusted reference intervals on-the-fly. Participants The project is a collaborative effort of four major hospitals in Switzerland, the University Hospital Bern (Inselspital, “Insel”), University Hospital Lausanne (CHUV), Swiss Spinal Cord Injury Cohort (“SwiSCI”) and the University Children’s Hospital Zurich (“KiSpi”), and two academic groups in Bern and in Lausanne. Findings to date Within the infrastructure we deployed, the laboratory data from four major hospitals (approximately 9 million measurements from 250’000 patients) is made available to two conceptually different web applications (one centralised and statistically detailed, one decentralised using distributed computing). They enable the inference of reference intervals for more than 40 blood test variables from clinical chemistry, haematology, point-of-care-testing, and coagulation testing, with various patient factors (such as age, sex and a combination of ICD-10 defined diagnoses) and analytical factors (such as type or unique identifiers) that can be used to generate precise reference intervals for the respective groups. Future plan Now that all required basic infrastructure elements for Swiss BioRef are deployed, we are evaluating inter-cohort transferability of semantic standards, “change tracking” in merged databases and biological variation of the blood test variables, in order to generate precise reference intervals. While adjusting the developed web-interfaces to suit the needs of the various end-users, we additionally plan to onboard new national and international partners. Strengths and limitations of this study The Swiss BioRef project is the first multi-cohort infrastructure in Switzerland for the estimation of precise reference intervals in laboratory medicine. With the BioRef consortium agreement a common framework for multi-cohort data sharing, hosting, and accessing has been thoroughly defined. The definition of interoperable data formats and data encoding for Swiss BioRef permits the fusion of the various data sources into a unified infrastructure. Due to differing data management systems at the individual clinical data warehouses, the harmonisation of data contributions requires significant effort which limits direct data provision. Two different web applications with varying data access architectures enable researchers to map the individual complexity of their patients into a substantiated statistical analysis to infer precise and highly relevant reference intervals. Needless to say, anticipating the requirements of an increasingly diverse user base remains a challenging task. Due to the modular expandable architecture of Swiss BioRef, potential national and international partners can easily access and even join the network.
Abstract Current solutions for privacy-preserving data sharing among multiple parties either depend on a centralized authority that must be trusted and provides only weakest-link security (e.g., the entity that manages private/secret cryptographic keys), or leverage on decentralized but impractical approaches (e.g., secure multi-party computation). When the data to be shared are of a sensitive nature and the number of data providers is high, these solutions are not appropriate. Therefore, we present U n L ynx , a new decentralized system for efficient privacy-preserving data sharing. We consider m servers that constitute a collective authority whose goal is to verifiably compute on data sent from n data providers. U n L ynx guarantees the confidentiality, unlinkability between data providers and their data, privacy of the end result and the correctness of computations by the servers. Furthermore, to support differentially private queries, U n L ynx can collectively add noise under encryption. All of this is achieved through a combination of a set of new distributed and secure protocols that are based on homomorphic cryptography, verifiable shuffling and zero-knowledge proofs. U n L ynx is highly parallelizable and modular by design as it enables multiple security/privacy vs. runtime tradeoffs. Our evaluation shows that U n L ynx can execute a secure survey on 400,000 personal data records containing 5 encrypted attributes, distributed over 20 independent databases, for a total of 2,000,000 ciphertexts, in 24 minutes.
ABSTRACT Using real-world evidence in biomedical research, an indispensable complement to clinical trials, requires access to large quantities of patient data that are typically held separately by multiple healthcare institutions. Centralizing those data for a study is often infeasible due to privacy and security concerns. Federated analytics is rapidly emerging as a solution for enabling joint analyses of distributed medical data across a group of institutions, without sharing patient-level data. However, existing approaches either provide only limited protection of patients’ privacy by requiring the institutions to share intermediate results, which can in turn leak sensitive patient-level information, or they sacrifice the accuracy of results by adding noise to the data to mitigate potential leakage. We propose FAMHE, a novel federated analytics system that, based on multiparty homomorphic encryption (MHE), enables privacy-preserving analyses of distributed datasets by yielding highly accurate results without revealing any intermediate data. We demonstrate the applicability of FAMHE to essential biomedical analysis tasks, including Kaplan-Meier survival analysis in oncology and genome-wide association studies in medical genetics. Using our system, we accurately and efficiently reproduce two published centralized studies in a federated setting, enabling biomedical insights that are not possible from individual institutions alone. Our work represents a necessary key step towards overcoming the privacy hurdle in enabling multi-centric scientific collaborations.
The growing availability of genomic data is rev- olutionizing both genomic research and medical practice by enabling what is referred to as precision medicine. Yet, one major obstacle to the development of precision medicine to its full potential is the privacy concerns related to genomic- data sharing. Unfortunately, no solution that protects genomic privacy and, at the same time, preserves data utility has been developed so far. In this paper we introduce GENOSHARE, a comprehensive tool for systematically reasoninig about inference attacks on genomic data, and that enables privacy-aware sharing of such data, i.e., sharing that respects individuals’ privacy preferences. GENOSHARE considers the risk of disclosure of sensitive attributes via attacks based on genotype, membership, and kinship inference; by using novel genomic risk-oriented metrics, it raises an alarm when an individual’s privacy policy is at risk of being violated. GENOSHARE also protects individuals’ data from inferences that could be made from the denial of data requests by using avatar genomes, instead of the real genomes. We demonstrate that different inference attacks can benefit from each other and we show the effectiveness of GENOSHARE at detecting possible inferences related to genotypes on real data from the 1000 Genomes Project.
Clinical notes contain valuable information for research and monitoring quality of care. Named Entity Recognition (NER) is the process for identifying relevant pieces of information such as diagnoses, treatments, side effects, etc., and bring them to a more structured form. Although recent advancements in deep learning have facilitated automated recognition, particularly in English, NER can still be challenging due to limited specialized training data. This exacerbated in hospital settings where annotations are costly to obtain without appropriate incentives and often dependent on local specificities. In this work, we study whether this annotation process can be effectively accelerated by combining two practical strategies. First, we convert usually passive annotation tasks into a proactive contest to motivate human annotators in performing a task often considered tedious and time-consuming. Second, we provide pre-annotations for the participants to evaluate how recall and precision of the pre-annotations can boost or deteriorate annotation performance. We applied both strategies to a text de-identification task on French clinical notes and discharge summaries at a large Swiss university hospital. Our results show that proactive contest and average quality pre-annotations can significantly speed up annotation time and increase annotation quality, enabling us to develop a text de-identification model for French clinical notes with high performance (F1 score 0.94).