DA-ACFNet: Adaptive Cross-Modal Transformer Fusion for Emotion Recognition in Neurodiverse Children
Contributors
surendra Ramteke
Dr. Sunil Kumar
Keywords
Proceeding
Track
Engineering and Sciences
License
Copyright (c) 2026 Sustainable Global Societies Initiative

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Abstract
Facial Emotion Recognition (FER) is an emerging technology in assistive healthcare, special education and affective human-computer interaction. It is important to note that regular facial expression recognition models may not be effective in recognition of neurodiversity children's facial expressions, since their expressions can be subtle, inconsistent, or even different from those of neurotypical children. In this paper, we suggest DA-ACFNet, which is an adaptive transformer-based fusion model for emotion recognition of neurodiverse children. This proposed model is a dual stream model, consisting of real facial and synthetic facial representations. A ResNeXt backbone is used to extract the spatial features from real facial images, and transformer encoders are employed to process synthetic facial images, created by augmentation. The adaptive cross-modal attention module learns to optimally integrate complementary emotional information from the two streams. Furthermore, a hybrid loss function is applied to enhance the inter-class discrimination, which is a combination of cross-entropy and contrastive learning. Results in the experiments demonstrate the superiority of the performance of DA-ACFNet over CNN, ResNet-50, EfficientNet-B0 and compact vision transformer baselines. The proposed model is able to achieve 98.2% accuracy in the overall emotion recognition task, showing its efficiency in emotion recognition in children with different levels of neurodiversity.