Automatic Classification Between COVID-19 and Non-COVID-19 Pneumonia Using Symptoms, Comorbidities, and Laboratory Findings: The Khorshid COVID Cohort Study

2021 
The COVID-19, also known as the severe acute respiratory syndrome coronavirus (SARS-CoV-2), has been a disaster in 2020. Accurate and early diagnosis of COVID-19 is still essential for health policymaking. The reverse transcriptase-polymerase chain reaction (RT-PCR) has been used as the operational gold standard for COVID-19 diagnosis. We aimed to design and implement a reliable COVID-19 diagnosis method to provide the risk of infection using demographics, symptoms and signs, blood markers, and the family history of diseases to have excellent agreement with the results obtained by the RT-PCR and check CT-scan. Our study primarily used sample data from a one-year hospital-based prospective COVID-19 open-cohort, the Khorshid COVID Cohort (KCC) study. A sample of 634 COVID-19 patients and 118 pneumonia patients with similar characteristics whose RT-PCR and chest CT-scan were negative (as the control group) (dataset 1) were used to the system design and internal validation. Two other online datasets, including some symptoms (dataset 2) and blood tests (dataset 3), were also analyzed. The combination of one-hot encoding, stability feature selection, over-sampling, and an ensemble classifier was used. Three-fold stratified cross-validation was used for internal validation. In addition to gender and symptom duration, signs and symptoms, blood biomarkers, and comorbidities were selected. The performance indices of the cross-validated confusion matrix were as the following for dataset 1: Sensitivity of 96% [CI 95%: 94-98], specificity of 95% [90-99], PPV of 99% [98-100], NPV of 82% [76-89], DOR of 496 [198-1245], AUC of 0.96 [0.94-0.97], MCC of 0.87 [0.85-0.88], accuracy of 96% [94-98], and the Cohen's kappa of 0.86 [0.81-0.91]. The proposed algorithm showed excellent diagnosis accuracy and class-labeling agreement, and fair discriminant power. The AUC on the datasets 2 and 3 was 0.97 [0.96-0.98] and 0.92 [0.91-0.94], respectively. The most important feature was white blood cells count, shortness of breath, and C-Reactive Protein for datasets 1, 2, and 3, respectively. The proposed algorithm is thus a promising COVID-19 diagnosis method, which could be an amendment to simple blood tests and symptoms screening. However, the RT-PCR and chest CT-scan used as the gold standard are not 100% accurate.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    71
    References
    0
    Citations
    NaN
    KQI
    []