Understanding Chest X-Ray Vision Representations with DeViL

Title:Understanding Chest X-Ray Vision Representations with DeViL

Authors:Isabel Rio-Torto, Bruno Coelho, Jaime Cardoso and Luís F. Teixeira

Conference:IEEE CBMS 2026

Tags:chest X-ray analysis, interpretability, open-vocabulary saliency, post-hoc explainability, saliency maps and structured radiology report generation

Abstract:

Explainability is essential for the reliable deployment of deep learning models in clinical settings, with most research focusing on post-hoc visual saliency methods. However, these rely on task-specific classifiers, limiting their applicability for analysing vision encoders independently of downstream tasks. Moreover, these methods can only explain the predefined set of classes for which the underlying model was trained. In this work, we investigate how DeViL, a framework that translates visual features into natural language, can be leveraged for understanding representations learned by chest X-ray vision models. DeViL requires no task-specific heads, since it only uses the frozen vision encoder, and it is able to generate open-vocabulary saliency maps because it uses a language model. We adapt DeViL to align visual features with clinically meaningful concepts by training on structured radiology reports and using a radiology-specialised language model. We conduct experiments on structured radiology report generation, saliency generation, and open-vocabulary saliency-text grounding on different types of chest X-ray vision encoders: convolutional, self-supervised Vision Transformer, and vision–language Transformer. Results show competitive performance with large end-to-end report generation models and demonstrate that DeViL's open-vocabulary saliency maps outperform those produced by a specialised saliency generation method for vision-language encoders.