Attention-Based Acoustic Encoding: Transformer-Driven Longitudinal Vocal Biomarkers for Enhanced Depression Detection
Contributors
Prof. (Dr.) Dhananjay S. Deshpande
Shashi Kant Gupta
Sai Kiran Oruganti
Keywords
Proceeding
Track
Engineering, Sciences, Mathematics & Computations
License
Copyright (c) 2026 Sustainable Global Societies Initiative

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Abstract
Spotting the signs of depression in a person's voice is notoriously difficult. The vocal clues are often subtle and vary greatly from person to person. While existing AI models have been used for this task, they often miss the broader context in speech and the tiny, yet critical, shifts in tone and rhythm that can signal depression. To tackle this, we built a new model inspired by powerful Transformer technology, which uses a "self-attention" mechanism. Think of it as teaching the model to focus more intently on the most telling parts of a voice recording—like a visual map of sound—to pick up on patterns such as flat intonation, unusually long pauses, or energy changes. A key feature of our system is its ability to track these vocal patterns over time for an individual, making it more resilient to differences between speakers or background noise. In tests on standard depression speech datasets, our approach proved to be more accurate and sensitive than current methods. We believe this is a promising step toward creating practical tools that could help doctors with early detection and ongoing monitoring, offering a scalable way to support those at risk for depression.