Predicting Diabetes Risk Using Machine Learning: A Comparative Study on the Yazd Health Study (YaHS)

  • Fateme Sefid Department of Molecular Medicine,School of Advanced Technologies in Medicine,Shahid Sadoughi University of Medical Sciences Yazd Iran
  • Nazanin Norouzi-Ghahjavarestani Department of Computer Science, Yazd University, Yazd, Iran.
  • Malihe Soleymani-Tabasi Department of Computer Science, Yazd University, Yazd, Iran.
  • Jamal Zarepour-Ahmadabadi Department of Computer Science, Yazd University, Yazd, Iran.
  • Ghasem Azamirad Department of Mechanical Engineering, Yazd University, Yazd, Iran
  • Mohamah yahya Vahidi Diabetes Research Center, Non-communicable Diseases Research Institute, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
  • Masoud Mirzaei Yazd Cardiovascular Research Centre, Non-Communicable Diseases Research Centre, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
  • Seyed Mehdi Kalantar Abortion Research Centre, Yazd Reproductive Sciences Institute, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
Keywords: Machine learning, Diabetes risk prediction, Yazd Health Study (YaHS), Random forest

Abstract

Diabetes is a chronic disease that can significantly affect health at the global level, highlighting the importance of accurate early risk prediction to support prevention and management efforts. This study aims to evaluate the effectiveness of some efficient machine learning algorithms: Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), Naïve Bayes (NB), and Decision Tree (DT) in diabetes risk prediction using dataset acquired from Yazd Health Study (YaHS). Extensive preprocessing steps, including data cleaning, class imbalance handling through Synthetic Minority Oversampling Technique and Edited Nearest Neighbors (SMOTEENN), and feature selection, are applied to enhance the performance of models. Among the evaluated machine learning algorithms, the Random Forest classifier achieved the highest performance with an accuracy of 97%, outperforming other methods in terms of predictive capability. The findings highlight the vital importance of effective data preprocessing and algorithm selection in developing reliable predictive models from healthcare datasets.

Published
2025-07-28
Section
Articles