Evaluation of a rule-based method for epidemiological document classification towards the automation of systematic reviews

George Karystianis,Kristina A. Thayer,Mary S. Wolfe,Guy Tsafnat

Evaluation of a rule-based method for epidemiological document classification towards the automation of systematic reviews

2017

Display Omitted Implemented a generic rule-based method to extract epidemiological characteristics.Created environmental health exposure specific dictionaries.Precision ranging from 81% to 100% across six characteristics.Automated extraction is a feasible approach towards automation of systematic reviews. IntroductionMost data extraction efforts in epidemiology are focused on obtaining targeted information from clinical trials. In contrast, limited research has been conducted on the identification of information from observational studies, a major source for human evidence in many fields, including environmental health. The recognition of key epidemiological information (e.g., exposures) through text mining techniques can assist in the automation of systematic reviews and other evidence summaries. MethodWe designed and applied a knowledge-driven, rule-based approach to identify targeted information (study design, participant population, exposure, outcome, confounding factors, and the country where the study was conducted) from abstracts of epidemiological studies included in several systematic reviews of environmental health exposures. The rules were based on common syntactical patterns observed in text and are thus not specific to any systematic review. To validate the general applicability of our approach, we compared the data extracted using our approach versus hand curation for 35 epidemiological study abstracts manually selected for inclusion in two systematic reviews. ResultsThe returned F-score, precision, and recall ranged from 70% to 98%, 81% to 100%, and 54% to 97%, respectively. The highest precision was observed for exposure, outcome and population (100%) while recall was best for exposure and study design with 97% and 89%, respectively. The lowest recall was observed for the population (54%), which also had the lowest F-score (70%). ConclusionThe generated performance of our text-mining approach demonstrated encouraging results for the identification of targeted information from observational epidemiological study abstracts related to environmental exposures. We have demonstrated that rules based on generic syntactic patterns in one corpus can be applied to other observational study design by simple interchanging the dictionaries aiming to identify certain characteristics (i.e., outcomes, exposures). At the document level, the recognised information can assist in the selection and categorization of studies included in a systematic review.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations