Speculative decoding has become a promising technique to mitigate the high inference latency of autoregressive decoding in Large Language Models (LLMs). Despite its promise, the effective application of speculative decoding in LLMs still confronts three key challenges: the increasing memory demands of the draft model, the distribution shift between the short-training corpora and long-context inference, and inefficiencies in attention implementation.
In this work, we enhance the performance of speculative decoding in long-context settings by addressing these challenges. First, we propose a memory-efficient draft model with a constant-sized Key-Value (KV) cache. Second, we introduce novel position indices for short-training data, enabling seamless adaptation from short-context training to long-context inference. Finally, we present an innovative attention aggregation method that combines fast implementations for prefix computation with standard attention for tree mask handling, effectively resolving the latency and memory inefficiencies of tree decoding.
Our approach achieves strong results on various long-context tasks, including repository-level code completion, long-context summarization, and o1-like long reasoning tasks, demonstrating significant improvements in latency reduction.
While speculative decoding has shown promise in reducing inference latency, existing research primarily focuses on short-context scenarios. The true potential of speculative decoding lies in long-context settings, where autoregressive decoding struggles to utilize GPU resources efficiently. However, draft models designed for short-context tasks fail to generalize well to long sequences due to fundamental limitations. We identify three key challenges that hinder their effectiveness in long-context speculative decoding.
As the sequence length increases, existing draft models, such as EAGLE and GliDe, require a linearly growing Key-Value (KV) cache, leading to excessive memory consumption. This problem is particularly severe in long-context inference, where efficient memory usage is crucial.
Training data for speculative decoding primarily consists of short-context samples, which causes a mismatch when applied to long-context inference. The draft model, having been trained mostly on small position indices, struggles to speculate effectively on large indices.
Tree-based speculative decoding, while effective, faces computational bottlenecks due to incompatibilities with optimized attention kernels. Traditional implementations suffer from inefficient memory access patterns, increasing latency.
To overcome the challenges of long-context speculative decoding, we introduce LongSpec, a novel framework designed to improve efficiency and scalability. Our methodology consists of three key innovations that address memory overhead, position index distribution shift, and tree attention inefficiencies.
To address the memory overhead problem, we introduce a memory-efficient draft model with constant memory usage, regardless of context length. Our architecture consists of a self-attention module with sliding-window attention to constrain memory usage and a cross-attention module that reuses the KV cache of the target model. Furthermore, we share the Embedding Layer and LM Head between the draft and target models, significantly reducing memory demands for large-vocabulary LLMs.
To solve the position index distribution shift problem, we propose two key techniques: Anchor-Offset Indices. The Anchor-Offset Indices strategy ensures balanced training across all positions by assigning large, random offsets after a fixed set of attention sink tokens, preventing over-reliance on small indices while maintaining compatibility with the target model.
To mitigate the tree attention inefficiencies, which means the incompatibility between Tree Speculative
Decoding and optimized attention kernels like Flash_Decoding
, we introduce
Hybrid Tree Attention, a method that partitions key-value pairs into two groups: cached tokens,
which require no masking, and speculative tokens, which require masked attention.
We then merge the outputs using a log-sum-exp trick with theoretical validity.
The table and the figure below show the decoding speeds and mean accept lengths across the five evaluated datasets at T=0 and T=1 respectively. Our proposed method significantly outperforms all other approaches on both summarization tasks and code completion tasks. When T=0, on summarization tasks, our method can achieve a mean accepted length of around 3.5 and a speedup of up to 2.67x; and on code completion tasks, our method can achieve a mean accepted length of around 4 and a speedup of up to 3.26x. This highlights the robustness and generalizability of our speculative decoding approach, particularly in long-text generation tasks. At T=1, our method's performance achieves around 2.5x speedup, maintaining a substantial lead over MagicDec. This indicates that our approach is robust across different temperature settings, further validating its soundness and efficiency.
Long Chain-of-Thought (LongCoT) tasks have gained significant attention recently due to their ability to enable models to perform complex reasoning and problem-solving over extended outputs. In these tasks, while the prefix input is often relatively short, the generated output can be extremely long, posing unique challenges in terms of efficiency and token acceptance. Our method is particularly well-suited for addressing these challenges, effectively handling scenarios with long outputs. It is worth mentioning that MagicDec is not suitable for such long-output scenarios because the initial inference stage of the LongCoT task is not the same as the traditional long-context task. In LongCoT tasks, where the prefix is relatively short, the draft model in MagicDec will completely degrade into the target model, failing to achieve acceleration.
We evaluate our method on the QwQ-32B model using the widely-used benchmark AIME24 dataset, with a maximum output length set to 32k tokens.
The results, illustrated below, demonstrate a significant improvement in both generation speed and mean accepted tokens.
Specifically, our method achieved a generation rate of 42.63 tokens/s, 2.25x higher than the baseline's 18.92 tokens/s, and an average of 3.82 mean accepted tokens.
Notably, QwQ-32B with LongSpec achieves even lower latency than the vanilla 7B model with Flash_Decoding
, demonstrating that our method effectively accelerates the LongCoT model.
These findings not only highlight the effectiveness of our method in the LongCoT task but also provide new insights into lossless inference acceleration for the o1-like model.
We believe that speculative decoding will play a crucial role in accelerating this type of model in the future.
The figure below shows that pretrained with Anchor-Offset Indices achieve a lower initial loss and final loss compared to those trained without it when training over the real long-context dataset. Notably, the initalization with Anchor-Offset Indices reaches the same loss level $3.93\times$ faster than its counterpart.
The results presented below highlight the effectiveness of the proposed Hybrid Tree Attention, which combines Flash_Decoding
with the Triton kernel fused_mask_attn
.
While the time spent on the draft model forward pass and the target model FFN computations remain comparable across the two methods, the hybrid approach exhibits a significant reduction in latency for the target model's attention layer (the yellow part).
Specifically, the attention computation latency decreases from 49.92 ms in the HF implementation to 12.54 ms in the hybrid approach, resulting in an approximately 75% improvement.
Flash_Decoding
.
Extensive experiments demonstrate the effectiveness of LongSpec in long-context understanding tasks
and real-world long reasoning tasks. Our findings highlight the importance of designing speculative
decoding methods specifically tailored for long-context settings and pave the way for future research
in efficient large-scale language model inference.
@article{yang2025longspec,
author = {Penghui Yang and Cunxiao Du and Fengzhuo Zhang and Haonan Wang and Tianyu Pang and Chao Du and Bo An},
title = {LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification},
journal = {arXiv},
year = {2025},
}