Tags:Counting, Medical Imaging, Multimodal AI, Object Detection and Vision-Language Models
Abstract:
We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Using MedMultiPoints, a multimodal dataset combining annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate these tasks as instruction-based prompts suitable for vision-language reasoning. We fine-tune Qwen2.5-VL-7B-Instruct using Low-Rank Adaptation (LoRA) across multiple task combinations. Results show that multi-task training improves accuracy and robustness for example, reducing Count Mean Absolute Error (MAE) and increasing Matching Accuracy in the Counting + Pointing task. However, we also observe trade-offs, such as increased zero-case point predictions despite improved average accuracy—indicating reduced reliability on edge cases as a trade-off for better performance on common samples. Our study highlights the potential of adapting general-purpose VLMs to specialized medical tasks via prompt-driven fine-tuning, achieving improved object counting and localization performance when combining tasks during training. The model retains interpretable, structured outputs, making it a promising step toward explainable, versatile medical AI. Code, model weights and evaluation protocols will be released for reproducibility.
Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models