Skin Disease Detection using Hybrid CNN and Vision Transformer Architecture
Contributors
Bipin P R
Upendra Kumar
Sai Kiran Oruganti
Keywords
Proceeding
Track
Engineering, Sciences, Mathematics & Computations
License
Copyright (c) 2026 Sustainable Global Societies Initiative

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Abstract
Skin cancer is one of the fastest-growing malignancies worldwide, with melanoma accounting for the majority of skin-related mortalities. Early detection of such conditions greatly enhances treatment outcomes and survival rates. Conventional visual diagnosis depends on dermatological expertise, which may lead to variability in interpretation. In recent years, deep learning methods—particularly convolutional neural networks (CNNs)—have shown remarkable potential in automating skin lesion classification. Nevertheless, CNNs often fail to capture global dependencies across an image. Vision Transformers (ViTs), on the other hand, utilize self-attention mechanisms that enable them to model long-range interactions between image patches.
This research proposes a hybrid model that integrates CNN and ViT architectures to leverage both local and global features for improved classification accuracy. The CNN component performs preprocessing and local feature extraction, while the ViT module captures global context. Experiments conducted on the ISIC 2019 dataset show that the hybrid model achieves superior accuracy compared with individual CNN or ViT systems. The proposed architecture presents a reliable solution for automated dermatological diagnostics suitable for clinical and telemedicine environments.