Cross-Modal Coordination in Reading: an Interpretable Audio–Video Fusion Pipeline for Cognitive Screening

Title:Cross-Modal Coordination in Reading: an Interpretable Audio–Video Fusion Pipeline for Cognitive Screening

Authors:Tiziana Currieri, Francesco Prinzi, Antonio Perna, Maria Santina Ler, Laura Ferraro, Gennaro Cordasco, Anna Esposito and Salvatore Vitabile

Conference:IEEE CBMS 2026

Tags:cross-modal coordination, digital prevention, facial motion analysis, interpretable machine learning and SHAP

Abstract:

Short reading recordings provide a practical basis for cognitive screening and prevention, but many speech-based systems compress the signal into rough global aggregates or rely on deep representations with limited transparency; conversely, face-video cues may add complementary digitalized information but are often less robust when used alone. We propose an interpretable audio--video pipeline that explicitly models cross-modal coordination during reading. Audio is represented through eGeMAPS descriptors, while video is processed with MediaPipe FaceMesh to derive frame-level mouth-opening and head-tilt dynamics. An audio-energy proxy is time-aligned to the video stream to compute synchrony descriptors (Pearson correlation and cross-correlation lag) and misalignment episodes, i.e., intervals in which acoustic energy and mouth activity provide discordant evidence of speech activity. Subject-level representations summarise robust statistics, temporal-block dynamics, and misalignment ratios, counts, and durations. We evaluate audio-only, video-only, and multimodal fusion models (early feature concatenation and late decision fusion) under leakage-safe, subject-level 5-fold stratified cross-validation. Early fusion with a Random Forest achieves the highest overall accuracy (ACC=0.752), while late fusion provides the most favourable class-balanced trade-off (BACC=0.678; AUROC up to 0.739), indicating complementary information across modalities. Overall, explicit and clinically readable coordination descriptors support an effective and interpretable multimodal approach for prevention in reading-based cognitive screening.