Pixnatte:A Hybrid BLIP–CLIP Framework for Grounded Image Caption Generation

Khadija Slimani; Anjanadevi B

Home
Proceedings
Vol. 1 No. 3 (2026): LGPR Batch 2 Conference 3
Paper

Pixnatte:A Hybrid BLIP–CLIP Framework for Grounded Image Caption Generation

Date Published : 12 May 2026

Contributors

Khadija Slimani

Author

Anjanadevi B

Author

Keywords

Image Captioning; BLIP; CLIP; Visual Question Answering; MS COCO; Multimodal Learning; Semantic Grounding

Proceeding

Vol. 1 No. 3 (2026): LGPR Batch 2 Conference 3

Track

Engineering and Sciences

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Abstract

Automated image captioning sits at the intersection of computer vision and natural language processing, where the core challenge lies in generating descriptions that are simultaneously fluent, contextually accurate, and semantically grounded. Existing captioning systems based solely on encoder-decoder architectures often produce plausible but factually imprecise captions, particularly when faced with compositionally complex scenes. This paper presents Pixnatte, a hybrid system that integrates Bootstrapping Language-Image Pre-training (BLIP) with Contrastive Language-Image Pre-training (CLIP) to address this limitation. The proposed architecture leverages CLIP's robust semantic embedding space to anchor caption generation within BLIP's encoder-decoder framework, reducing hallucination and improving alignment with image content. Experiments conducted on the MS COCO benchmark yield a CIDEr score of 1.6779, BLEU-1 of 0.4444, ROUGE-L of 0.4815, and METEOR of 0.2382, demonstrating competitive performance against conventional baselines. Additionally, Pixnatte extends its capabilities to Visual Question Answering (VQA), enabling natural language interaction with visual content. Results confirm that grounded caption generation through BLIP–CLIP integration produces measurably superior captions and holds practical potential for applications in digital media, accessibility, and intelligent content systems.

References

No References

Downloads

PDF

How to Cite

Slimani, K., & B, A. (2026). Pixnatte:A Hybrid BLIP–CLIP Framework for Grounded Image Caption Generation. Sustainable Global Societies Initiative, 1(3). https://vectmag.com/sgsi/paper/view/560