Machine Learning Model to Diagnose Diabetes Type 2 Based on Health Behavior


GAZI UNIVERSITY JOURNAL OF SCIENCE, vol.35, no.3, pp.834-852, 2022 (ESCI) identifier identifier

  • Publication Type: Article / Article
  • Volume: 35 Issue: 3
  • Publication Date: 2022
  • Doi Number: 10.35378/gujs.931760
  • Journal Indexes: Emerging Sources Citation Index (ESCI), Scopus, Academic Search Premier, Aerospace Database, Aquatic Science & Fisheries Abstracts (ASFA), Communication Abstracts, Compendex, Metadex, Civil Engineering Abstracts, TR DİZİN (ULAKBİM)
  • Page Numbers: pp.834-852
  • Keywords: Artificial intelligence, Diabetes, Health behavior, Gradient boosting, ANN, LIFE-STYLE INTERVENTIONS, PREVENT
  • Eskisehir Osmangazi University Affiliated: Yes


Diabetes, in 2016, was the 7th death-causing disease in the world. It was the direct cause of 1.6 million deaths. In 2019, the number of adults (20-79 years) that were living with diabetes was approximately 463 million and is expected to rise to 700 million in 2045. The early diagnosis of diabetes will help treat it and prevent its complications. The need for an easy and fast way to diagnose diabetes is crucial. In this study, we are proposing a method to diagnose diabetes with the help of machine learning algorithms and tools. The proposed method utilizes the power of machine learning to create a model that can predict diabetes based on the health behavior of the patient. The model uses the relationship between a healthy lifestyle and diabetes. Our goal is to build a reliable machine learning model to predict diabetes, which will help significantly in easing and speeding up the diagnosing procedure of diabetes. We used modern machine learning algorithms like XGBoost, LightGBM, CatBoost, and artificial neural networks, and the dataset was obtained from the National Health and Nutrition Examination Survey (NHANES). In our study, the XGBoost algorithm performed the best with a Cross-Validation (10-fold) score of 0.864, and an overall accuracy of 87.7% for the validation dataset and 84.96% for the test dataset.