Explainable CNN–Radiomics Fusion and Ensemble Learning for Multimodal Lesion Classification in Dental Radiographs


Can Z., Aydin E.

DIAGNOSTICS, cilt.15, sa.16, ss.1997, 2025 (SCI-Expanded)

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 15 Sayı: 16
  • Basım Tarihi: 2025
  • Doi Numarası: 10.3390/diagnostics15161997
  • Dergi Adı: DIAGNOSTICS
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, EMBASE, INSPEC, Directory of Open Access Journals
  • Sayfa Sayıları: ss.1997
  • Eskişehir Osmangazi Üniversitesi Adresli: Evet

Özet

Background/Objectives: Clinicians routinely rely on periapical radiographs to identify root-end disease, but interpretation errors and inconsistent readings compromise diagnostic accuracy. We, therefore, developed an explainable, multimodal AI framework that (i) fuses two data modalities, deep CNN embeddings and radiomic texture descriptors that are extracted only from lesion-relevant pixels selected by Grad-CAM, and (ii) makes every prediction transparent through dual-layer explainability (pixel-level Grad-CAM heatmaps + feature-level SHAP values). Methods: A dataset of 2285 periapical radiographs was processed using six CNN architectures (EfficientNet-B1/B4/V2M/V2S, ResNet-50, Xception). For each image, a Grad-CAM heatmap generated from the penultimate layer of the CNN was thresholded to create a binary mask that delineated the region most responsible for the network’s decision. Radiomic features (first-order, GLCM, GLRLM, GLDM, NGTDM, and shape2D) were then computed only within that mask, ensuring that handcrafted descriptors and learned embeddings referred to the same anatomic focus. The two feature streams were concatenated, optionally reduced by principal component analysis or SelectKBest, and fed to random forest or XGBoost classifiers; five-view test-time augmentation (TTA) was applied at inference. Pixel-level interpretability was provided by the original Grad-CAM, while SHAP quantified the contribution of each radiomic and deep feature to the final vote. Results: Raw CNNs achieved around  52% accuracy and AUC values near 0.60. The multimodal fusion raised performance dramatically; the Xception + radiomics + random forest model achieved a 95.4% accuracy and an AUC of 0.9867, and adding TTA increased these to 96.3% and 0.9917, respectively. The top ensemble, Xception and EfficientNet-V2S fusion vectors classified with XGBoost under five-view TTA, reached a 97.16% accuracy and an AUC of 0.9914, with false-positive and false-negative rates of 4.6% and 0.9%, respectively. Grad-CAM heatmaps consistently highlighted periapical regions, while SHAP plots revealed that radiomic texture heterogeneity and high-level CNN features jointly contributed to correct classifications. Conclusions: By tightly integrating CNN embeddings, mask-targeted radiomics, and a two-tiered explainability stack (Grad-CAM + SHAP), the proposed system delivers state-of-the-art lesion detection and a transparent technique, addressing both accuracy and trust.