Evaluating a Multimodal Foundation Model for Glaucoma Classification from Fundus Images

Title:Evaluating a Multimodal Foundation Model for Glaucoma Classification from Fundus Images

Authors:Francesco Di Serio, Michela Gravina, Vincenzo Moscato, Carlo Sansone, Consuelo Gonzalo-Martin and Angel Garcia-Pedrero

Conference:IEEE CBMS 2026

Tags:Foundation Models, Fundus Photography, Glaucoma, LoRA, MedSigLIP, Vision-Language Models and Zero-Shot Learning

Abstract:

Glaucoma is a leading cause of irreversible blindness worldwide, and early detection from fundus photography remains a major clinical challenge. Deep learning models achieve strong performance but require large labeled datasets and often fail to generalize. Multimodal foundation models offer a potential alternative, enabling zero-shot and few-shot adaptation through natural language prompting. In this work, we present the first evaluation of MedGemma, a recently released medical vision-language model, for fundus-based glaucoma detection. Using the standardized SMDG-19 benchmark, we conduct a comparative evaluation of three strategies: zero-shot prompting, parameter-efficient fine-tuning via Low-Rank Adaptation (LoRA), and feature-based classification using embeddings extracted from the MedSigLIP vision encoder. Our results show that zero-shot inference achieves limited accuracy, highlighting the complexity of glaucoma detection in fundus images. In contrast, LoRA-based adaptation significantly improves performance, demonstrating the benefits of task-specific specialization. Notably, the best results are obtained when leveraging MedSigLIP visual embeddings within a dedicated classifier, suggesting that the intrinsic visual representations learned by the foundation model are highly discriminative for glaucoma screening. These findings highlight both the promise and the current limitations of foundation models for ophthalmic screening, underscoring the need for improved hybrid inference strategies and more effective multimodal integration.