Tags:Bangla Image Captioning, BanglaView, EfficientNetB4, Flicker30k and ResNet-50
Abstract:
The importance of Bangla image captioning is rooted in the need to bridge the gap between visual content and textual descriptions in Bengali, thereby facilitating more accessible and inclusive technologies for Bengali-speaking individuals. Image captioning entails the creation of textual descriptions for images through the application of deep learning techniques that combine methodologies from computer vision and natural language processing to accurately recognize and depict visual content. This study presents an innovative methodology that employs EfficientNetB4 and ResNet-50 architectures for the extraction of features. The selection of these models followed a comprehensive evaluation of numerous alternatives, as they exhibited exceptional performance, rendering them well-suited for the objectives of this investigation. A pivotal contribution of this analysis is the introduction of the newly established BanglaView dataset, designed specifically for Bangla captioning tasks, used alongside the widely recognized Flickr30k dataset. The empirical results reveal that EfficientNetB4 surpasses ResNet-50, attaining a BLEU score of 0.54 after merely 10 training epochs, underscoring its efficacy in generating coherent Bangla captions. These outcomes not only illustrate technological progress but also highlight the potential of this methodology to stimulate innovations that promote linguistic diversity and improve user experiences in domains such as accessibility and digital communication.
Generating Bangla Image Captions with Deep Learning Techniques