A Review on Multimodal Product Clustering and Vision Transformer Based Product Recommender System
Contributors
Ssvr Kumar Addagarla
Dr. Upendra Kumar
Keywords
Proceeding
Track
Engineering, Sciences, Mathematics & Computations
License
Copyright (c) 2026 Sustainable Global Societies Initiative

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Abstract
In modern e-commerce, recommender systems serve as a crucial force in helping users navigate large catalogs to find the most relevant products. The majority of prevailing recommendation approaches, ranging from collaborative filtering to content-based and deep visual models, primarily capture coarse, category-level similarities while failing to recognize fine-grained product characteristics. Material, design patterns, intended usage, and compatibility in style are commonly missed, giving rise to visually similar recommendations that are not functionally or contextually complementary. Many state-of-the-art models also serve as black boxes, providing little insight into why a particular product is recommended. This review paper discusses the recent works on multimodal product clustering and vision transformer–based recommendation systems, focusing on how visual and textual modalities can be jointly leveraged for fine-grained product semantics capture. Major techniques, strengths, and weaknesses of the existing approaches are highlighted in the survey, with an emphasis on explainability and context awareness in recommendation frameworks. By consolidating current advances and open challenges, this review seeks to provide clear grounds for future research into fine-grained, interpretable multimodal product recommendation.