MicroLLM: Ultra-Compressed Language Model Deployment on Microcontrollers using Structured Sparsity and 2-bit Quantization

Vugar Abdullayev

Home
Proceedings
Vol. 1 No. 5 (2026): LGPR Batch 4 Conference 1
Paper

MicroLLM: Ultra-Compressed Language Model Deployment on Microcontrollers using Structured Sparsity and 2-bit Quantization

Date Published : 1 June 2026

Contributors

Vugar Abdullayev

Azerbaijan State Oil and Industry University

Author

Keywords

Microcontroller Language model compression 2-bit quantization Edge AI TinyML ARM Cortex-M

Proceeding

Vol. 1 No. 5 (2026): LGPR Batch 4 Conference 1

Track

Engineering and Sciences

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Abstract

The proliferation of edge computing and Internet of Things (IoT) devices has created an urgent demand for deploying large language models (LLMs) on resource-constrained microcontrollers (MCUs) with limited flash memory, SRAM, and computational throughput. This paper presents MicroLLM, a novel framework that integrates structured sparsity and 2-bit quantization to enable ultra-compressed language model deployment on ARM Cortex-M class microcontrollers. Proposed methodology combines magnitude-based structured pruning at the attention head and feed-forward neuron level with a custom non-uniform 2-bit quantization scheme that preserves critical weight distributions while dramatically reducing memory footprint. Experiments conducted on STM32H7 and nRF52840 platforms demonstrate that MicroLLM achieves a 94.2% reduction in model size relative to the original FP32 baseline, with only a 4.8% degradation in perplexity on the WikiText-2 benchmark. Additionally, hardware-aware kernel optimizations reduce inference latency by 67% compared to naively quantized baselines. MicroLLM opens pathways for deploying conversational AI, keyword spotting, and on-device NLP directly on MCUs without cloud dependency, enabling privacy-preserving and real-time edge intelligence.

References

No References

Downloads

PDF

How to Cite

Vugar Abdullayev, V. A. (2026). MicroLLM: Ultra-Compressed Language Model Deployment on Microcontrollers using Structured Sparsity and 2-bit Quantization. Sustainable Global Societies Initiative, 1(5). https://vectmag.com/sgsi/paper/view/630