Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy

2021 
Aim: In neuroscience research data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different re-sampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different data sets, without considering performance on each specific data set. In this study we compare performances of different re-sampling procedures for imbalanced domain in stereo-electroencephalography (SEEG) recordings of patients with focal epilepsies who underwent surgery. Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between epileptogenic and non-epileptogenic brain regions in interictal condition. We investigated the effectiveness of 5 over-sampling and 5 under-sampling procedures, using 10 different machine learning classifiers. Moreover, 6 specific ensemble methods for imbalanced domain were also tested. To compare performances, AUC, F-measure, Geometric Mean and Balanced Accuracy were considered. Results: Both re-sampling procedures showed clearly improved performances with respect to the original data set. The over-sampling procedure was found to be more sensitive to the type of classification method employed, with ADASYN (Adaptive synthetic sampling) exhibiting the best performances. All the under-sampling approaches were more robust than the oversampling among the different classifiers, with RUS (Random undersampling) exhibiting the best performance despite being the simplest and most basic classification method. Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. Our results also highlight the importance of the type of the classification method that has to be used together with the resampling in order to maximize the benefit to the outcome.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    66
    References
    0
    Citations
    NaN
    KQI
    []