IEEE Access, cilt.13, ss.185802-185817, 2025 (SCI-Expanded, Scopus)
Neural speech synthesis now produces speech that can sound convincingly human, challenging security and forensics. We propose a detector that fuses an interpretable 51-dimensional spectro-temporal vector (13 MFCCs, 13 ΔMFCCs, 12 chroma, 7 spectral-contrast, 6 tonnetz) with compact CNN embeddings (EfficientNet-B1/B4, EfficientNet-V2-S/M, Xception, ResNet-50). Evaluation spans two complementary datasets: a controlled ESOGU corpus (real vs. synthetic from CoquiTTS, DiffVC, FreeVC) and the public ASVspoof2021-LA benchmark (bonafide vs. spoof across 13 attack systems, A07-A19). Duration controls remove utterance-length cues, and interpretability analyses verify reliance on formant structure and spectral texture rather than recording quirks. On ESOGU, the 51-D vector alone achieves 100% binary accuracy and 99.65% three-class recognition; with fusion, EfficientNet-B1 reaches 100% Stage-1 and 99.75% Stage-2 accuracy. These perfect scores are confined to ESOGU under our protocol; performance on ASVspoof2021-LA is lower. On ASVspoof2021-LA, where codec/channel diversity makes detection harder, fusion raises performance where it matters most: EfficientNet-V2-M attains 94.59% binary accuracy and 87.09% 13-way spoof attribution, and the bonafide-class F1 improves by ≈+0.019-0.035 over MFCC-only methods. Permutation-importance highlights low-order MFCCs, ΔMFCC dynamics, and spectral-contrast bands as principal cues, and Grad-CAMs corroborate attention to characteristic peak-valley structure. These results show that well-chosen, interpretable acoustics, joined with lightweight CNN representations, deliver robust and explainable synthetic-speech detection without resorting to ever-larger end-to-end models.