Mitigating Biases in CORD-19 for Analyzing COVID-19 Literature

2020 
On the behest of the Office of Science and Technology Policy in the White House, six institutions, including ours, have created an open research dataset called CORD-19 to facilitate the development of question-answering systems that can assist researchers in finding relevant research on COVID-19. As of May 27th, 2020, CORD-19 includes more than 100 thousand open access publications from major publishers and PubMed as well as preprint articles deposited into medRxiv and bioRxiv. As CORD-19 is a small sample of the vast relevant literature, it inevitably contains sampling biases. To overcome these biases, statistical measures used in this study are smoothed by augmenting CORD-19 with its citation network. In total, three expanded sets are created for the analyses: (1) the enclosure set CORD-19E composed of CORD-19 articles and their references and citations, mirroring the methodology used in the renowned “A Century of Physics” analysis, (2) the full closure graph CORD-19C that recursively includes references starting with CORD-19, and (3) the inflection closure CORD-19I that is a much smaller subset of CORD-19C but already appropriate for statistical analysis based on theory of the scale-free nature of the citation network. Taken together, all these expanded datasets show much smoother trends when used to analyze global COVID-19 research. The results suggest that, while CORD-19 exhibits a strong tilt towards recent and highly focused articles, the knowledge being explored to attack the pandemic encompasses a much longer time span and is very interdisciplinary. A question-answering system with such extended knowledge may perform better in understanding the literature and answering related questions. Still, the collaboration patterns, especially in terms of team sizes and geographical distributions, are more resilient to sampling biases and captured very well already in CORD-19 as the raw statistics and trends agree with those from larger datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    26
    References
    7
    Citations
    NaN
    KQI
    []