Predicting Diabetes Using Machine Learning Approaches

Ratna Rathaur; Shivam Pandey; Ashutosh Mani

Predicting Diabetes Using Machine Learning Approaches

Ratna Rathaur, Shivam Pandey, Ashutosh Mani*

Department of Biotechnology, Motilal Nehru National Institute of Technology Allahabad, India
*Corresponding author: amani@mnnit.ac.in

Received: 12 Nov 2025 | Accepted: 22 Dec 2025 | Published: 26 Dec 2025

Abstract

Diabetes mellitus is a chronic metabolic disorder and a major public health concern. This study applied machine learning techniques to predict early diabetes risk using the Pima Indian Diabetes Dataset. Multiple models including Logistic Regression, Decision Tree, Naive Bayes, Support Vector Machine, and Random Forest were evaluated. The Random Forest model showed the best performance in terms of accuracy, precision, recall, and ROC-AUC. The results demonstrate the potential of machine learning-based approaches for early diagnosis and risk prediction of diabetes.

Keywords

Diabetes mellitus, Machine learning, Random Forest, Risk prediction, Integrative biology

Introduction

Diabetes mellitus is one of the most prevalent non-communicable diseases worldwide, affecting millions of individuals. Early detection is crucial to prevent complications such as cardiovascular diseases and neuropathy. Traditional diagnostic methods, though effective, are often invasive and resource-intensive. Machine learning offers a scalable alternative for early risk prediction using clinical data.

Materials and Methods

The study used the Pima Indian Diabetes Dataset consisting of 768 samples. Data preprocessing included handling missing values and normalization. Five supervised machine learning models were implemented and evaluated using standard performance metrics including accuracy, precision, recall, F1-score, and ROC-AUC.

Results

Among all models, Random Forest achieved the highest accuracy (81.6%) and demonstrated superior performance across evaluation metrics. Feature importance analysis revealed plasma glucose level as the most significant predictor, followed by BMI and age.

Discussion

The results highlight the effectiveness of ensemble-based machine learning techniques in disease prediction. Random Forest outperformed other models due to its robustness and ability to handle complex data patterns. These findings support the integration of machine learning tools in clinical decision-making.

Conclusion

Machine learning models, particularly Random Forest, provide an effective approach for early diabetes risk prediction. Future work should focus on expanding datasets and integrating real-world clinical applications.

References

(Include full reference list as in PDF)

Download Full PDF

← Back to Issue