MicroLLM: Ultra-Compressed Language Model Deployment on Microcontrollers using Structured Sparsity and 2-bit Quantization
Contributors
Vugar Abdullayev
Keywords
Proceeding
Track
Engineering and Sciences
License
Copyright (c) 2026 Sustainable Global Societies Initiative

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Abstract
The proliferation of edge computing and Internet of Things (IoT) devices has created an urgent demand for deploying large language models (LLMs) on resource-constrained microcontrollers (MCUs) with limited flash memory, SRAM, and computational throughput. This paper presents MicroLLM, a novel framework that integrates structured sparsity and 2-bit quantization to enable ultra-compressed language model deployment on ARM Cortex-M class microcontrollers. Proposed methodology combines magnitude-based structured pruning at the attention head and feed-forward neuron level with a custom non-uniform 2-bit quantization scheme that preserves critical weight distributions while dramatically reducing memory footprint. Experiments conducted on STM32H7 and nRF52840 platforms demonstrate that MicroLLM achieves a 94.2% reduction in model size relative to the original FP32 baseline, with only a 4.8% degradation in perplexity on the WikiText-2 benchmark. Additionally, hardware-aware kernel optimizations reduce inference latency by 67% compared to naively quantized baselines. MicroLLM opens pathways for deploying conversational AI, keyword spotting, and on-device NLP directly on MCUs without cloud dependency, enabling privacy-preserving and real-time edge intelligence.