Ensemble learning models for the prediction of the weekly peak of PM2.5 concentration in Algiers, Algeria

Sabri Ghazi; Ahmed Dib; Mohamed Said Mehdi Mendjel; Tarek Khadir; Julie Dugdale

doi:10.18502/japh.v8i3.13783

Sabri Ghazi Electronic Document Management Laboratory (LabGED), Department of Computer Science, University Badji Mokhtar, Annaba, Algeria
Ahmed Dib System and Networking Laboratory ( LRS), Department of Computer Science, University Badji Mokhtar, Annaba, Algeria
Mohamed Said Mehdi Mendjel Electronic Document Management Laboratory (LabGED), Department of Computer Science, University Badji Mokhtar, Annaba, Algeria
Tarek Khadir Electronic Document Management Laboratory (LabGED), Department of Computer Science, University Badji Mokhtar, Annaba, Algeria
Julie Dugdale University Grenoble Alpes, Grenoble Informatics Laboratory (LIG), France

DOI: https://doi.org/10.18502/japh.v8i3.13783

Keywords: Particulate matter with an aerodynamic diameter of less than 2.5 µm (PM2.5); Air pollution; Ensemble learning; Time series forecasting; Air pollution prediction

Abstract

Introduction: This paper focuses on the prediction of weekly peak levels of Particulate Matter with an aerodynamic diameter of less than 2.5 µm (PM2.5 ), using various Machine Learning (ML) models. The study compares ML models to deep learning models and emphasizes the explain ability of ML models for PM2.5 prediction.

Materials and methods: We examine different combinations of features and time window dimensions to evaluate the performance of ML models. It utilizes Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), Decision Tree (DT), and five Ensemble Models (EL) including AdaBoost, XGBoost, LightGBM, CatBoost, and Random Forest (RF). The dataset includes three years of daily measurements of weather parameters and PM2.5.

Results: Lagged values of PM2.5 improves prediction performance, particularly when the lagged value window size spans seven days or multiples thereof. This confirms that road traffic, which exhibits a weekly seasonality, is the primary source of PM2.5 in Algiers. Interestingly, including lagged values of weather parameters decreases prediction performance, even when chosen based on their correlation with PM2.5. The AdaBoost model performs the best, achieving a Root Mean Squared Error (RMSE) of 2.899 µg/m³ and an R² value of 0.96.

Conclusion: EL models, specifically AdaBoost, exhibit strong performance in predicting PM 2.5 levels. They not only provide accurate predictions but also allow analysis of feature importance. Lagged values of PM2.5 have a greater impact on predictions compared to weather parameters. Surprisingly, including weather parameters hampers prediction performance. Therefore, the utilization of ensemble learning models offers valuable insights into feature significance.