Explainable Multimodal LLMs: Integrating Multi-Shot Reasoning for Transparent and Trustworthy AI

Aleem Ali; Shashi Kant Gupta

Home
Proceedings
Vol. 1 No. 1 (2026): LGPR Batch 1 Conference 4
Paper

Explainable Multimodal LLMs: Integrating Multi-Shot Reasoning for Transparent and Trustworthy AI

Date Published : 11 January 2026

Contributors

Aleem Ali

Lincoln University College

Author

Shashi Kant Gupta

Author

Keywords

Explainable Multimodal Large Language Models Multi-Shot Multimodal Reasoning Cross-Modal Explainability Attention-Based Interpretability.

Proceeding

Vol. 1 No. 1 (2026): LGPR Batch 1 Conference 4

Track

Engineering, Sciences, Mathematics & Computations

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Abstract

Recent advancements in multimodal large language models (MLLMs) have expanded artificial intelligence capabilities to process and reason across diverse modalities—such as text, image, and video. However, the decision-making processes of these models remain largely opaque, limiting their deployment in critical and trust-sensitive domains. This paper introduces an explainability-driven extension of the Multi-Shot Multimodal Large Language Model (MS-MLLM), integrating interpretability modules to enable transparent and trustworthy multimodal reasoning. The proposed model combines cross-attention fusion, multi-shot contextual learning, and explainable visual-textual inference through attention-based and gradient-based interpretability mechanisms. Experiments on benchmark datasets—MIMIC-CXR, MS COCO, and YouTube8M—demonstrate that the proposed framework maintains high performance (89% accuracy in medical diagnosis, CIDEr score of 112 for image captioning, and 82% accuracy in video QA) while offering interpretable insights via heatmaps and textual rationales. The study underscores the necessity of integrating explainability into multi-shot multimodal learning to ensure human-aligned, transparent, and reliable AI systems for real-world applications.

References

No References

Downloads

PDF

How to Cite

Aleem Ali, A. A., & Shashi Kant Gupta, S. K. G. (2026). Explainable Multimodal LLMs: Integrating Multi-Shot Reasoning for Transparent and Trustworthy AI. Sustainable Global Societies Initiative, 1(1). https://vectmag.com/sgsi/paper/view/66