Bridge Inspection Using A Multi-Modal Vision Language Model

11 pages•Published: August 28, 2025

Zhengxing Chen, Yang Zou, Vicente González, Jason Ingham and Liam Wotherspoon

Abstract

Using an Unmanned Aerial Vehicle (UAV) in bridge inspections can reduce human involvement in complex and hazardous inspection environments and automate the inspection process. Current practices require human operators to define task objectives, oversee safe flight operations, and evaluate bridge conditions. There is a growing demand for improving the seamless collaboration between UAVs and human inspectors to complete the inspection task efficiently and more safely, especially in post-disaster scenarios where critical bridges and other infrastructure facilities need to be inspected within hours or days. A significant gap exists in enabling UAVs to intelligently perceive and understand the bridge inspection scene according to human instructions. An intuitive human-UAV collaboration system using a multi-modal Vision Language Model (VLM) was proposed to partially fill this gap. This system leverages a few-shot Contrastive Language–Image Pretraining (CLIP)-based model to enable UAVs to visually and semantically understand the bridge inspection environment based on human commands. By incorporating text prompt learning with a cache adapter, the proposed model enhances the ability of CLIP to interpret both textual and visual inputs in the context of bridge inspection. The model was trained and evaluated in a bridge inspection image dataset and achieved an accuracy of 83.33%, outperforming other few-shot image classification methods, demonstrating its effectiveness in the bridge inspection domain. This approach is expected to improve collaboration between AI-empowered UAVs, inspectors, and bridge environments, thereby enhancing the overall efficiency of bridge inspections.

Keyphrases: bridge inspection, human robot collaboration (hrc), natural language interaction, unmanned aerial vehicle (uav), vision language model (vlm)

In: Jack Cheng and Yu Yantao (editors). Proceedings of The Sixth International Conference on Civil and Building Engineering Informatics, vol 22, pages 578-588.

Links:	https://easychair.org/publications/paper/tr9R
	https://doi.org/10.29007/m7wj

BibTeX entry

@inproceedings{ICCBEI2025:Bridge_Inspection_Using_Multi,
  author    = {Zhengxing Chen and Yang Zou and Vicente González and Jason Ingham and Liam Wotherspoon},
  title     = {Bridge Inspection Using A Multi-Modal Vision Language Model},
  booktitle = {Proceedings of The Sixth International Conference on Civil and Building Engineering Informatics},
  editor    = {Jack Cheng and Yu Yantao},
  series    = {Kalpa Publications in Computing},
  volume    = {22},
  publisher = {EasyChair},
  bibsource = {EasyChair, https://easychair.org},
  issn      = {2515-1762},
  url       = {/publications/paper/tr9R},
  doi       = {10.29007/m7wj},
  pages     = {578-588},
  year      = {2025}}

Download PDF Open PDF in browser