Tags:Health Informatics, Natural Language Processing, Vision Language Model and Visual Question Answering
Abstract:
Vision-Language Models (VLMs) excel in multimodal tasks, yet their effectiveness in specialized medical applications remains underexplored. Accurate interpretation of medical images and text is crucial for clinical decision support, particularly in multiple-choice question answering (MCQA). To address the lack of benchmarks in this domain, we introduce the Multilingual Multimodal Medical Exam Dataset (MMMED), designed to assess VLMs' ability to integrate visual and textual information for medical reasoning. MMMED includes 582 MCQA pairs from Spanish medical residency exams (MIR), with multilingual support (Spanish, English, Italian) and paired medical images. We benchmark state-of-the-art VLMs, analyzing their strengths and limitations across languages and modalities. The dataset is publicly available on Hugging Face (https://huggingface.co/datasets/praiselab-picuslab/MMMED), with experimental code on GitHub (https://github.com/PRAISELab-PicusLab/MMMED).
A Multilingual Multimodal Medical Examination Dataset for Visual Question Answering in Healthcare