Predicting Diabetes Risk Using Machine Learning: A Comparative Study on the Yazd Health Study (YaHS)

Fateme  Sefid; Nazanin  Norouzi-Ghahjavarestani; Malihe  Soleymani-Tabasi; Jamal Zarepour-Ahmadabadi; Ghasem  Azamirad; Mohamah yahya  Vahidi Mehrjardi; Masoud  Mirzaei; Seyed Mehdi  Kalantar

doi:10.18502/ijdo.v17i3.19267

Fateme Sefid Department of Molecular Medicine,School of Advanced Technologies in Medicine,Shahid Sadoughi University of Medical Sciences Yazd Iran
Nazanin Norouzi-Ghahjavarestani Department of Computer Science, Yazd University, Yazd, Iran.
Malihe Soleymani-Tabasi Department of Computer Science, Yazd University, Yazd, Iran.
Jamal Zarepour-Ahmadabadi Department of Computer Science, Yazd University, Yazd, Iran.
Ghasem Azamirad Department of Mechanical Engineering, Yazd University, Yazd, Iran
Mohamah yahya Vahidi Mehrjardi Diabetes Research Center, Non-communicable Diseases Research Institute, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
Masoud Mirzaei Yazd Cardiovascular Research Centre, Non-Communicable Diseases Research Centre, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
Seyed Mehdi Kalantar Abortion Research Centre, Yazd Reproductive Sciences Institute, Shahid Sadoughi University of Medical Sciences, Yazd, Iran

DOI: https://doi.org/10.18502/ijdo.v17i3.19267

Keywords: Machine learning, Diabetes risk prediction, Yazd Health Study (YaHS), Random forest

Abstract

Diabetes is a chronic disease that can significantly affect health at the global level, highlighting the importance of accurate early risk prediction to support prevention and management efforts. This study aims to evaluate the effectiveness of some efficient machine learning algorithms: Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), Naïve Bayes (NB), and Decision Tree (DT) in diabetes risk prediction using dataset acquired from Yazd Health Study (YaHS). Extensive preprocessing steps, including data cleaning, class imbalance handling through Synthetic Minority Oversampling Technique and Edited Nearest Neighbors (SMOTEENN), and feature selection, are applied to enhance the performance of models. Among the evaluated machine learning algorithms, the Random Forest classifier achieved the highest performance with an accuracy of 97%, outperforming other methods in terms of predictive capability. The findings highlight the vital importance of effective data preprocessing and algorithm selection in developing reliable predictive models from healthcare datasets.