|
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
| Volume 187 - Issue 77 |
| Published: January 2026 |
| Authors: Sayyada Sara Banu, Ratnadeep R. Deshmukh |
10.5120/ijca2026926227
|
Sayyada Sara Banu, Ratnadeep R. Deshmukh . Experimental Analysis of an Interactive MFCC + AHC Speaker Diarization Framework Across Multi-Domain Audio Conditions. International Journal of Computer Applications. 187, 77 (January 2026), 35-43. DOI=10.5120/ijca2026926227
@article{ 10.5120/ijca2026926227,
author = { Sayyada Sara Banu,Ratnadeep R. Deshmukh },
title = { Experimental Analysis of an Interactive MFCC + AHC Speaker Diarization Framework Across Multi-Domain Audio Conditions },
journal = { International Journal of Computer Applications },
year = { 2026 },
volume = { 187 },
number = { 77 },
pages = { 35-43 },
doi = { 10.5120/ijca2026926227 },
publisher = { Foundation of Computer Science (FCS), NY, USA }
}
%0 Journal Article
%D 2026
%A Sayyada Sara Banu
%A Ratnadeep R. Deshmukh
%T Experimental Analysis of an Interactive MFCC + AHC Speaker Diarization Framework Across Multi-Domain Audio Conditions%T
%J International Journal of Computer Applications
%V 187
%N 77
%P 35-43
%R 10.5120/ijca2026926227
%I Foundation of Computer Science (FCS), NY, USA
Automatic Speaker Diarization (ASD)—the process of determining “who spoke when”—is essential for transcription, conversational analytics, call-center monitoring, courtroom recordings, and multilingual human–computer interaction. Classical systems based on MFCCs, GMMs, and hierarchical clustering are interpretable but struggle in noisy, overlapping, and diverse audio conditions, while modern deep-learning approaches like x-vectors, ECAPA-TDNN, and Wav2Vec 2.0 offer higher accuracy but lack transparency. This study evaluates a visualization-enhanced MFCC–GMM–AHC diarization framework across AMI, VoxCeleb, CALLHOME, Mozilla Common Voice, and a custom English–Hindi dataset. The system integrates adaptive VAD, MFCC + Δ + Δ² features, GMM modeling, AHC clustering, and Viterbi re-segmentation with rich diagnostic tools. Results show strong segmentation quality and speaker separability, with DER improving from 12.8% (MFCC–GMM) to 4.7% (Wav2Vec 2.0). The framework demonstrates robust, interpretable, and multi-domain performance.