An integrated approach based on Gaussian noises-based data augmentation method and AdaBoost model to predict faecal coliforms in rivers with small dataset

2021 
Abstract Machine Learning (ML) techniques can be valuable for modelling the faecal contamination in the rivers to overcome the limitations of the process-based models. However, this approach requires large sufficient data for training and validation processes to avoid the over-fitting problem. This study attempts to overcome the small dataset limitation by relying on the data augmentation techniques. To that end, Adaptive boosting (AdaBoost) models were trained and integrated into the data augmentation method to generate 600 virtual samples based on 40 original datasets. The results revealed that the proposed method significantly improved the accuracy (RMSE = 0.716ln(Colony Forming Unit (CFU)/100 ml)) and generalization ability of the AdaBoost model for predicting the faecal coliform in the rivers compared to the baseline model developed only with a small dataset (RMSE = 2.348ln(CFU/100 ml)). However, the study showed that generating and using too many virtual data could deteriorate the generalization ability of the ML model and the optimal virtual datasets are about (337–415) virtual samples. Globally, the results of this study provide new insights to improve the prediction accuracy of the health risk related to the faecal coliforms in raw water used for drinking purposes under a small dataset. The developed method can broaden the application of ML to water resources and environmental sciences when it is impossible to get a large dataset required by ML models.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    69
    References
    0
    Citations
    NaN
    KQI
    []