| ||||
| ||||
![]() Title:Morpheus: Accelerating Large Language Models with Feature-Augmented Autoregressive Drafting Conference:PRICAI 2025 Tags:Inference Acceleration, Large Language Models and Speculative Decoding Abstract: Autoregressive decoding makes inference for Large Language Models (LLMs) both memory bandwidth-bound and time-consuming. In this paper, we reconsider draft head paradigm in speculative decoding and derive two key observations. Firstly, existing draft heads are sequentially independent, speculating on draft tokens without considering their preceding context within the continuation. Secondly, highly ambiguous tokens disproportionately corrupt the effective length of draft sequences generated by draft heads. Based on these insights, we propose Morpheus, a draft head that generates draft tokens sequentially in an autoregressive manner. By integrating features from the target model and the draft head itself from the previous time step, Morpheus effectively extended the average acceptance length, thereby increasing the end-to-end decoding rate. We conducted comprehensive evaluations of Morpheus, including code generation task and text generation task. For Vicuna 7B, Morpheus improves the speed of decoding by 1.15x and 2.5x compared to Medusa decoding and autoregressive decoding, respectively. Morpheus: Accelerating Large Language Models with Feature-Augmented Autoregressive Drafting ![]() Morpheus: Accelerating Large Language Models with Feature-Augmented Autoregressive Drafting | ||||
| Copyright © 2002 – 2025 EasyChair |
