Skin Disease Classification Using Hybrid ResNet-50 based CNN preprocessing followed by ConvNeXtV2 and Vision Transformer Architecture
Contributors
Bipin P R
Sai Kiran Oruganti
Upendra Kumar
Keywords
Proceeding
Track
Engineering, Sciences, Mathematics & Computations
License
Copyright (c) 2026 Sustainable Global Societies Initiative

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Abstract
Skin diseases, particularly melanoma, pose a significant global health burden due to their increasing incidence and high mortality rates when diagnosis is delayed. Traditional diagnosis relies heavily on dermatological expertise and visual inspection, which may suffer from subjectivity and inter-observer variability. Recent advances in deep learning have enabled automated analysis of dermoscopic images, with Convolutional Neural Networks (CNNs) demonstrating strong performance in extracting local texture and color features. However, CNNs are limited in modeling long-range spatial dependencies. Vision Transformers (ViTs), which utilize self-attention mechanisms, address this limitation by capturing global contextual information, but often require large datasets and substantial computational resources.
This paper proposes a hybrid deep learning framework that integrates CNN-based preprocessing and feature extraction with a Vision Transformer for global feature modeling. A ResNet-50 based CNN preprocessing followed by ConvNeXtV2 architecture is employed to extract discriminative local features, while a ViT-B/16 model captures long-range dependencies across image patches. Feature fusion is performed through ensemble concatenation, followed by a classification head for benign and malignant skin lesion prediction. Experiments conducted on the ISIC 2019 dataset demonstrate that the proposed hybrid model achieves superior accuracy, precision, recall, and F1-score compared to standalone CNN and transformer models. The results indicate that the hybrid ResNet-50 based CNN preprocessing followed by ConvNeXtV2–ViT architecture provides a robust and reliable solution for automated skin disease diagnosis in clinical and telemedicine environments.