Pixnatte:A Hybrid BLIP–CLIP Framework for Grounded Image Caption Generation
Contributors
Khadija Slimani
Anjanadevi B
Keywords
Proceeding
Track
Engineering and Sciences
License
Copyright (c) 2026 Sustainable Global Societies Initiative

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Abstract
Automated image captioning sits at the intersection of computer vision and natural language processing, where the core challenge lies in generating descriptions that are simultaneously fluent, contextually accurate, and semantically grounded. Existing captioning systems based solely on encoder-decoder architectures often produce plausible but factually imprecise captions, particularly when faced with compositionally complex scenes. This paper presents Pixnatte, a hybrid system that integrates Bootstrapping Language-Image Pre-training (BLIP) with Contrastive Language-Image Pre-training (CLIP) to address this limitation. The proposed architecture leverages CLIP's robust semantic embedding space to anchor caption generation within BLIP's encoder-decoder framework, reducing hallucination and improving alignment with image content. Experiments conducted on the MS COCO benchmark yield a CIDEr score of 1.6779, BLEU-1 of 0.4444, ROUGE-L of 0.4815, and METEOR of 0.2382, demonstrating competitive performance against conventional baselines. Additionally, Pixnatte extends its capabilities to Visual Question Answering (VQA), enabling natural language interaction with visual content. Results confirm that grounded caption generation through BLIP–CLIP integration produces measurably superior captions and holds practical potential for applications in digital media, accessibility, and intelligent content systems.