A Review on Multimodal Product Clustering and Vision Transformer Based Product Recommender System


Date Published : 11 January 2026

Contributors

Ssvr Kumar Addagarla

Author

Dr. Upendra Kumar

Institute of Engineering & Technology, Lucknow
Author

Keywords

Multimodal recommendation; Product clustering; Vision transformers; Explainable AI; E-commerce

Proceeding

Track

Engineering, Sciences, Mathematics & Computations

License

Copyright (c) 2026 Sustainable Global Societies Initiative

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Abstract

In modern e-commerce, recommender systems serve as a crucial force in helping users navigate large catalogs to find the most relevant products. The majority of prevailing recommendation approaches, ranging from collaborative filtering to content-based and deep visual models, primarily capture coarse, category-level similarities while failing to recognize fine-grained product characteristics. Material, design patterns, intended usage, and compatibility in style are commonly missed, giving rise to visually similar recommendations that are not functionally or contextually complementary. Many state-of-the-art models also serve as black boxes, providing little insight into why a particular product is recommended. This review paper discusses the recent works on multimodal product clustering and vision transformer–based recommendation systems, focusing on how visual and textual modalities can be jointly leveraged for fine-grained product semantics capture. Major techniques, strengths, and weaknesses of the existing approaches are highlighted in the survey, with an emphasis on explainability and context awareness in recommendation frameworks. By consolidating current advances and open challenges, this review seeks to provide clear grounds for future research into fine-grained, interpretable multimodal product recommendation.

References

No References

Downloads

How to Cite

Addagarla, S. K., & Dr. Upendra Kumar, D. U. K. (2026). A Review on Multimodal Product Clustering and Vision Transformer Based Product Recommender System. Sustainable Global Societies Initiative, 1(2). https://vectmag.com/sgsi/paper/view/161