Explainable Multimodal LLMs: Integrating Multi-Shot Reasoning for Transparent and Trustworthy AI
Contributors
Aleem Ali
Shashi Kant Gupta
Keywords
Proceeding
Track
Engineering, Sciences, Mathematics & Computations
License
Copyright (c) 2026 Sustainable Global Societies Initiative

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Abstract
Recent advancements in multimodal large language models (MLLMs) have expanded artificial intelligence capabilities to process and reason across diverse modalities—such as text, image, and video. However, the decision-making processes of these models remain largely opaque, limiting their deployment in critical and trust-sensitive domains. This paper introduces an explainability-driven extension of the Multi-Shot Multimodal Large Language Model (MS-MLLM), integrating interpretability modules to enable transparent and trustworthy multimodal reasoning. The proposed model combines cross-attention fusion, multi-shot contextual learning, and explainable visual-textual inference through attention-based and gradient-based interpretability mechanisms. Experiments on benchmark datasets—MIMIC-CXR, MS COCO, and YouTube8M—demonstrate that the proposed framework maintains high performance (89% accuracy in medical diagnosis, CIDEr score of 112 for image captioning, and 82% accuracy in video QA) while offering interpretable insights via heatmaps and textual rationales. The study underscores the necessity of integrating explainability into multi-shot multimodal learning to ensure human-aligned, transparent, and reliable AI systems for real-world applications.