Artificial Intelligence in Health, cilt.2, sa.4, ss.47-74, 2025 (Scopus)
In the contemporary context of the obesity epidemic and its associated comorbidities, early detection of individuals at risk is critical. Artificial intelligence and machine learning techniques offer substantial potential for automating obesity risk assessment, enabling early diagnosis and intervention. However, the development of robust predictive models is often hampered by limited or imbalanced datasets. Synthetic data generation has emerged as a key solution, allowing the expansion and balancing of data while preserving privacy. Recent surveys highlight that the synthetic minority oversampling technique (SMOTE) is a leading method for data generation in obesity detection. In line with this, our study analyzed the Estimation of Obesity Levels dataset, a dataset from the University of California, Irvine repository, focused on dietary habits and physical condition, which suffers from class imbalance. We compared three synthetic data generation approaches: SMOTE—nominal and continuous, variational autoencoders, and conditional tabular generative adversarial network. We trained multiple classifiers on the generated datasets and evaluated their performance. Classifiers trained on data including height and weight (i.e., body mass index [BMI]-related features) achieved F1-scores of up to 98.16%, as expected due to the direct role of BMI in obesity classification. Crucially, models trained without height and weight still achieved an F1-score of 74.48% when synthetic augmentation was used, demonstrating that useful obesity prediction models can be developed even in the absence of explicit anthropometric measures. These results indicate that synthetic data can enable accurate classification when key features are missing or when data are scarce.