Designing Tiny Machine Learning Model for Keyword Spotting Using Knowledge Distillation – A comprehensive review
Contributors
Selvaperumal P
Keywords
Proceeding
Track
Engineering and Sciences
License
Copyright (c) 2026 Sustainable Global Societies Initiative

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Abstract
Small-footprint Keyword spotting (KWS) in an IoT or edge device is the process of identifying pre-defined keywords from speech in local resource constrained devices using light-weight machine learning models. Since these devices have highly constrained processing, memory, and power capacities, it is difficult to design a tiny machine learning model with accuracy similar to the large models for keyword spotting from speech. This exhaustive review systematically examines recent advances in tiny machine learning-based keyword spotting systems including the study of various models used, knowledge distillation process employed, and quantization process for model compression. Most of the works surveyed uses Google Speech Commands dataset (versions v1 and v2) as the benchmark dataset. Most of the reported works predominantly use hand-crafted acoustic features extracted from the raw audio waveforms, with Mel-Frequency Cepstral Coefficients (MFCC) (typically 10–40 coefficients) and log-Mel filterbank energies (LFBE) or log-Mel spectrograms (e.g., 40–80 Mel bins) being the most common inputs to both teacher and student models during training or distillation. These acoustic features serve as the foundation for feeding efficient neural networks—such as CNNs (e.g., DS-CNN, BC-ResNet), Transformers (e.g., DistilHuBERT and LightHuBERT), or hybrid designs—during both teacher training and student distillation. Subsequently, knowledge distillation techniques (e.g., soft targets/logits, layer-wise hidden representations, contextualized latent transfer, or robust variants like VIC-KD) compress the model. The surveyed approaches produce substantial reductions in model size (often 29–75%) and inference cost while preserving accuracy. This survey is conducted to study the current keyword spotting models and identifying research gaps in them.