Structured Pruning of LLMs for Resource-Constrained Devices: a Survey

Title:Structured Pruning of LLMs for Resource-Constrained Devices: a Survey

Tags:Large language models (LLMs), Model compression, Resource-Constrained Devices and Structured pruning

Abstract:

The size of large language models have been increased massively and this makes it harder to run them in a resource-constrained environments. To address this issue, many optimizations techniques have emergered with pruning LLMs being one of them. Structured pruning, a branch of pruning category, compresses LLMs by removing entire groups of parameters structurally with the requirement of further finetuning or pretraining to recover the performance when required. To the best of our knowledge, this is the first in-depth survey paper solely focused on the structured pruning of large language models. Twenty-five papers in this category were examined, out of which 18 represent widely recognized methods and 7 are recent approaches. To organize these methods, a four-dimensional taxonomy based on pruning granularity, importance estimation, recovery strategy, and pruning schedule is introduced. The analysis shows clear trends: the dominance of component-level pruning, gradient-based importance scores, parameter-efficient tuning, and one-shot pruning across the four dimensions, respectively. Finally, the emerging trends, challenges, and future directions are discussed. This survey aims to guide researchers and engineers seeking efficient ways to bring large language models to resource-constrained environments.