Active-CLIP: Zero-Shot Active Learning with Visual Pseudo-Label Propagation for Efficient Med-VQA

Title:Active-CLIP: Zero-Shot Active Learning with Visual Pseudo-Label Propagation for Efficient Med-VQA

Authors:Willian Amorim, Gabriel Dias, Cid Santos and Priscila Saito

Conference:IEEE CBMS 2026

Tags:Active Learning, CLIP, Med-VQA and Pseudo-Labeling

Abstract:

Background and Objective: Medical Visual Question Answering (Med-VQA) demands accurate multimodal interpretation of images and clinical questions, yet is hindered by data scarcity and high annotation costs. Vision-Language Models (VLMs) like CLIP offer strong zero-shot capabilities, but suffer from poor calibration and hallucinations in medical domains. This study introduces Active-CLIP, a hybrid framework combining zero-shot active learning with semi-supervised pseudo-label propagation to maximize annotation efficiency and performance in low-data regimes without fine-tuning the foundation model. Methods: Using frozen CLIP (ViT-B/32) as a zero-shot feature extractor and uncertainty estimator, Active-CLIP selects informative samples via Shannon entropy and propagates high-confidence pseudo-labels to the remaining pool using visual similarity in CLIP embedding space. A multimodal transformer is trained from scratch on the enriched set and evaluated on four public Med-VQA datasets (VQA-RAD, Path-VQA, Omni-Med-VQA-Mini, SLAKE) across labeling budgets of 10%–90%, compared against a real-labels-only baseline. Results: Active-CLIP consistently outperforms the baseline, with the largest gains at low ratios (10%–30%): up to +13% in BLEU-4 (Path-VQA) and +41% in ROUGE-L (SLAKE), alongside improved semantic alignment (higher AS, lower AE). Conclusions: Active-CLIP offers a promising, annotation-efficient solution for Med-VQA in data-scarce settings by leveraging zero-shot exploration and reliable pseudo-label exploitation.