Multimodal AI for Medical Image Classification: a Comprehensive Analysis of Image and Text Contributions

Title:Multimodal AI for Medical Image Classification: a Comprehensive Analysis of Image and Text Contributions

Authors:Semanto Mondal, Antonino Ferraro, Martina Iammarino, Fabiano Pecorelli and Giuseppe De Pietro

Conference:IEEE CBMS 2026

Tags:BERT, Feature Fusion, Medical Image Classification, Multimodal Learning and ResNet50

Abstract:

Multimodal learning has shown promising directions in medical image analysis by integrating visual and textual information. In this study, we present a systematic analysis of the contributions of image and text modalities for medical image classification using the MedPix 2.0 dataset. We construct a multimodal dataset pairing 2050 clinical images with corresponding textual descriptions and formulate a supervised classification task based on image location categories. A deep learning framework is proposed that combines ResNet50 for image feature extraction with BERT for textual embeddings, followed by feature-level fusion using concatenation. Comparative experiments are conducted with image-only, text-only, and multimodal models. Results demonstrate that multimodal integration improves classification performance, achieving an accuracy of 0.9220 and F1-score of 0.9213, while text alone provides limited discriminative power. This study highlights the complementary role of textual information and provides insights into effective multimodal fusion strategies for medical imaging applications.