It is widely accepted and acknowledged that data harmonization is crucial: in its absence, the co-analysis of major tranches of high quality extant data is liable to inefficiency or error. However, despite its widespread practice, no formalized/systematic guidelines exist to ensure high quality retrospective data harmonization.To better understand real-world harmonization practices and facilitate development of formal guidelines, three interrelated initiatives were undertaken between 2006 and 2015. They included a phone survey with 34 major international research initiatives, a series of workshops with experts, and case studies applying the proposed guidelines.A wide range of projects use retrospective harmonization to support their research activities but even when appropriate approaches are used, the terminologies, procedures, technologies and methods adopted vary markedly. The generic guidelines outlined in this article delineate the essentials required and describe an interdependent step-by-step approach to harmonization: 0) define the research question, objectives and protocol; 1) assemble pre-existing knowledge and select studies; 2) define targeted variables and evaluate harmonization potential; 3) process data; 4) estimate quality of the harmonized dataset(s) generated; and 5) disseminate and preserve final harmonization products.This manuscript provides guidelines aiming to encourage rigorous and effective approaches to harmonization which are comprehensively and transparently documented and straightforward to interpret and implement. This can be seen as a key step towards implementing guiding principles analogous to those that are well recognised as being essential in securing the foundational underpinning of systematic reviews and the meta-analysis of clinical trials.
BACKGROUND: Understanding the complex interaction of risk factors that increase the likelihood of developing common diseases is challenging. The Canadian Partnership for Tomorrow Project (CPTP) is a prospective cohort study created as a population-health research platform for assessing the effect of genetics, behaviour, family health history and environment (among other factors) on chronic diseases. METHODS: Volunteer participants were recruited from the general Canadian population for a confederation of 5 regional cohorts. Participants were enrolled in the study and core information obtained using 2 approaches: attendance at a study assessment centre for all study measures (questionnaire, venous blood sample and physical measurements) or completion of the core questionnaire (online or paper), with later collection of other study measures where possible. Physical measurements included height, weight, percentage body fat and blood pressure. Participants consented to passive follow-up through linkage with administrative health databases and active follow-up through recontact. All participant data across the 5 regional cohorts were harmonized. RESULTS: A total of 307 017 participants aged 30–74 from 8 provinces were recruited. More than half provided a venous blood sample and/or other biological sample, and 33% completed physical measurements. A total of 709 harmonized variables were created; almost 25% are available for all participants and 60% for at least 220 000 participants. INTERPRETATION: Primary recruitment for the CPTP is complete, and data and biosamples are available to Canadian and international researchers through a data-access process. The CPTP will support research into how modifiable risk factors, genetics and the environment interact to affect the development of cancer and other chronic diseases, ultimately contributing evidence to reduce the global burden of chronic disease. Chronic disease prevention and individualized disease management are central to public health in the 21st century.1,2 However, the multifactorial etiology of most chronic diseases demands that we increase our understanding about how biology, genetics, environment and behaviours interact to affect disease risks and outcomes. Prospective cohort studies that track individuals over decades are important tools for exploring these complex interactions.3 One such tool is the Canadian Partnership for Tomorrow Project (CPTP) — a pan-Canadian prospective cohort that was envisioned in 2008 as a “population laboratory”4,5 to support Canadian and international population health research in evaluating the genetic, behavioural and environmental causes of cancer and other chronic diseases. The CPTP set out with an ambitious goal to recruit 300 000 participants from 8 Canadian provinces,4 and to obtain a venous blood sample for biobanking from as many participants as possible. It is the largest prospective cohort ever created in Canada, and baseline data are now available to Canadian and international researchers. The aim of this article is to provide a baseline cohort profile of the CPTP, summarizing key sociodemographic, behavioural and health-related characteristics of the participants. We summarize the CPTP design and participant recruitment, the harmonization of the core data, the biorepository, and the procedures established to support data sharing with researchers.
Understanding the complex interaction of risk factors that increase the likelihood of developing common diseases is challenging. The Canadian Partnership for Tomorrow Project (CPTP) is a prospective cohort study created as a population-health research platform for assessing the effect of genetics, behaviour, family health history and environment (among other factors) on chronic diseases.
METHODS:
Volunteer participants were recruited from the general Canadian population for a confederation of 5 regional cohorts. Participants were enrolled in the study and core information obtained using 2 approaches: attendance at a study assessment centre for all study measures (questionnaire, venous blood sample and physical measurements) or completion of the core questionnaire (online or paper), with later collection of other study measures where possible. Physical measurements included height, weight, percentage body fat and blood pressure. Participants consented to passive follow-up through linkage with administrative health databases and active follow-up through recontact. All participant data across the 5 regional cohorts were harmonized.
RESULTS:
A total of 307 017 participants aged 30–74 from 8 provinces were recruited. More than half provided a venous blood sample and/or other biological sample, and 33% completed physical measurements. A total of 709 harmonized variables were created; almost 25% are available for all participants and 60% for at least 220 000 participants.
INTERPRETATION:
Primary recruitment for the CPTP is complete, and data and biosamples are available to Canadian and international researchers through a data-access process. The CPTP will support research into how modifiable risk factors, genetics and the environment interact to affect the development of cancer and other chronic diseases, ultimately contributing evidence to reduce the global burden of chronic disease.
The Canadian Partnership for Tomorrow Project is a multistudy platform integrating the British Columbia Generations Project, Alberta's Tomorrow Project, the Ontario Health Study, CARTaGENE (Quebec) and the Atlantic Partnership for Tomorrow's Health. This paper describes the process used to harmonize the Health and Risk Factor Questionnaire data and provides an overview of the key information required to properly use the core data set generated.
Methods:
This is a descriptive analysis of the harmonization process that was developed on the basis of the Maelstrom Research guidelines for retrospective harmonization. Core variables (DataSchema) to be generated across cohorts were defined and the potential for cohort-specific data sets to generate the DataSchema variables was assessed. Where relevant, algorithms were developed and applied to process cohort-specific data into the DataSchema format, and information to be provided to data users was documented.
Results:
The Health and Risk Factor Questionnaire DataSchema (version 2.0, October 2017) comprised 694 variables. The assessment of harmonization potential for the variables over 12 cohort-specific data sets resulted in 6799 (81.6%) of the variables being considered as harmonizable. A total of 307 017 participants were included in the harmonized data set. Through the cohort data portal, researchers can find information about the definitions of variables, harmonization potential, algorithms applied to generate harmonized variables and participant distributions.
Interpretation:
The harmonization process enabled the creation of a unique data set including data on health and risk factors from over 307 000 Canadians. These data, in combination with complementary data sets, can be used to investigate the impact of biological, environmental and behavioural factors on cancer and chronic diseases.