A systematic approach to deal with highly imbalanced data when predicting flight cancellations and delays

2020 
As on-time performance is one of the main contributors to success in the world of commercial aviation, predictions on flight delays and cancellations can significantly improve operational efficiency and thus quality of service. Since flight delays and cancellations are occasional and infrequent events, operational on-time performance data is inherently imbalanced. This is especially the case for cancellations, as on average 1.6% of flights are cancelled, while about 33% of the flights is delayed. For this research, flight operational data is combined with weather data to predict flight delays and cancellations on prediction horizons of hours to months before the flight, by means of Neural Network and Random Forest machine learning algorithms. Since these algorithms naturally tend towards the usage of balanced data, the need exists to find a systematic approach to deal with the imbalance issues, in order to make accurate predictions. Hence, an imbalanced data approach is proposed, which analyses model performance with indicators such as precision and F1-score on varying data imbalance ratios. The imbalance ratios are obtained through the use of sampling techniques such as Synthetic Minority Oversampling and Random Undersampling. It is concluded that the highest precision is found without any sampling while for the highest F1-score sampling is essential. Additionally, the research confirms that severely imbalanced data, like the cancellation data, yields the worst performance when compared to medium imbalanced data, like the delay data.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []