Use of random forest to estimate population attributable fractions from a case-control study of Salmonella enterica serotype Enteritidis infections

2015 
Received 15 December 2014; Final revision 14 January 2015; Accepted 15 January 2015SUMMARYTo design effective food safety programmes we need to estimate how many sporadic foodborneillnesses are caused by specific food sources based on case-control studies. Logistic regression hassubstantive limitations for analysing structured questionnaire data with numerous exposures andmissing values. We adapted random forest to analyse data of a case-control study of Salmonellaenterica serotype Enteritidis illness for source attribution. For estimation of summary populationattributable fractions (PAFs) of exposures grouped into transmission routes, we devised acounterfactual estimator to predict reductions in illness associated with removing groupedexposures. For the purpose of comparison, we fitted the data using logistic regression models withstepwise forward and backward variable selection. Our results show that the forward and backwardvariable selection of logistic regression models were not consistent for parameter estimation, withdifferent significant exposures identified. By contrast, the random forest model produced estimatedPAFs of grouped exposures consistent in rank order with results obtained from outbreak data, withegg-related exposures having the highest estimated PAF (22·1%, 95% confidence interval 8·5–31·8).Random forest might be structurally more coherent and efficient than logistic regression models forattributing Salmonella illnesses to sources involving many causal pathways.Key words: Causality, counterfactual, foodborne diseases, logistic regression, machine learning.INTRODUCTIONEach year, about 9 million people in the United Statesbecome sick from known foodborne pathogens, result-ing in more than 120000 estimated hospitalizationsand 3000 deaths [1, 2]. To prevent foodborne illness,we need reliable estimates of the percentages of illnessattributable to specific foods so that targeted foodsafety interventions can be designed. Finding thesources of foodborne illnesses is challenging becausecausal pathways for most individual illnesses are un-known. Data from case-control studies of sporadicinfections are used to estimate population attributablefractions (PAFs), defined as the proportion of casesover a specified period that would be prevented ifthe causal exposure was removed from the population[3, 4]. Such estimates are needed by food safety regu-latory and public health agencies to assess the likelyeffect of interventions.Causal pathways of sporadic enteric diseases arecomplex, in part because the sources may or may
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    25
    Citations
    NaN
    KQI
    []