Every token an LLM generates requires a full forward pass through the entire model. A 70-billion parameter model. For one token. Then it does it again. And again. Hundreds or thousands of times, sequentially, to produce a single response.
In the KV caching post, we fixed the quadratic blowup of recomputing keys and values. That was a necessary optimization. But even with a perfectly cached model, the fundamental bottleneck remains: autoregressive generation is sequential. The model cannot start producing token until token exists, because token 's probability distribution depends on the entire preceding context including .
This post is about a technique that breaks that bottleneck without changing the model, without retraining, and without any approximation. The output is mathematically identical to standard decoding. The technique is called speculative decoding (Leviathan et al., 2023; Chen et al., 2023), and it is one of the most elegant ideas in modern LLM inference.
The Bottleneck: Why Inference is Memory-Bound
Let's revisit why generating tokens is slow. During the decode phase, at each step, the model:
- Takes one token embedding as input
- Passes it through every layer (projections, attention, FFN)
- Produces logits over the vocabulary
- Samples or selects one token
- Repeats
For a model like Llama 2 70B, each forward pass involves loading roughly 140 GB of model weights from GPU memory (HBM) into the compute units. But the actual arithmetic -- a matrix-vector multiply for each weight matrix -- is tiny. The compute-to-memory ratio is abysmal.
This is what it means to be memory-bound: the GPU spends most of its time waiting for data to arrive from memory, not performing calculations. The arithmetic intensity (FLOPs per byte loaded) during single-token decode is far below what the hardware can handle.
Here are the numbers for Llama 2 70B on an A100-80GB:
The GPU's compute units are utilized at less than 1% during autoregressive generation. This is why generating 100 tokens takes roughly the same wall-clock time whether you generate them one at a time or verify all 100 in a single batch. The forward pass is dominated by weight loading, and you load the same weights regardless of how many tokens you process.
This is the key insight that makes speculative decoding possible:
Verification is (nearly) free. Processing tokens in one forward pass costs almost the same as processing 1 token, because the bottleneck is loading weights, not computing with them.
The Core Idea: Draft Then Verify
Speculative decoding exploits this asymmetry with a beautifully simple two-phase approach:
Phase 1 -- Draft: A small, fast "draft" model generates candidate tokens autoregressively. Because it's small (say, 1B parameters vs 70B), it runs 10-50x faster per token.
Phase 2 -- Verify: The large "target" model processes all candidate tokens in a single forward pass. It computes the target probability for each token position and decides whether to accept or reject each draft.
If all tokens are accepted, we've generated tokens for the cost of cheap draft steps plus one expensive target step -- instead of expensive target steps. If some are rejected, we still make progress: we keep all tokens up to the first rejection and sample a correction from the target model.
Let's trace through a concrete example. Suppose we're generating the sentence "The quick brown fox jumps over" with :
Standard decoding (4 target forward passes):
- Target model: context = "The" → generates "quick" (40ms)
- Target model: context = "The quick" → generates "brown" (40ms)
- Target model: context = "The quick brown" → generates "fox" (40ms)
- Target model: context = "The quick brown fox" → generates "jumps" (40ms)
- Total: 160ms for 4 tokens
Speculative decoding (K=4 draft steps + 1 target verification):
- Draft model generates 4 candidates: "quick", "brown", "fox", "jumped" (4 × 2ms = 8ms)
- Target model verifies all 4 in one pass (40ms)
- First 3 accepted, "jumped" rejected → sample correction "jumps"
- Total: 48ms for 4 tokens (3.3x speedup!)
The draft model got 3 out of 4 right. We accepted those, rejected the wrong one, and sampled the correct token from the target model as a bonus. We got 4 tokens for the price of 48ms instead of 160ms.
The Math: Modified Rejection Sampling
The magic of speculative decoding is not just that it's fast -- it's that the output distribution is exactly identical to standard decoding from the target model. Not approximately. Exactly. Let's see why.
Setup
Let be the target model's distribution at position , and be the draft model's distribution. For brevity, I'll write and .
The draft model samples a token . We want to decide whether to accept this sample as if it came from .
The Acceptance Rule
For each draft token , compute:
This means:
- If : Always accept. The draft model assigned less probability to this token than the target would, so the draft is "under-confident" here. Accepting always is fine.
- If : Accept with probability . The draft model is "over-confident" about this token relative to the target. We randomly reject it to correct the bias.
This is a form of rejection sampling, a classic technique in statistics for sampling from a target distribution using proposals from an easier distribution.
What Happens on Rejection?
When we reject a draft token at position , we discard tokens (all remaining drafts after the rejection point) and sample a single token from the recovery distribution:
This is the normalized "residual" distribution: it contains exactly the probability mass that the draft model under-represented. Tokens where get positive mass; tokens where get zero mass.
The recovery distribution ensures we fill in exactly the probability that rejection sampling "missed." Together, the accept + recovery process produces samples from the exact target distribution .
Working Through an Example
Let's say the vocabulary is {A, B, C} and the distributions at some position are:
| Token | (target) | (draft) | Accept prob | |
|---|---|---|---|---|
| A | 0.5 | 0.3 | 1.0 | 0.2 |
| B | 0.3 | 0.5 | 0.6 | 0.0 |
| C | 0.2 | 0.2 | 1.0 | 0.0 |
Suppose the draft samples B (). The acceptance probability is .
- With probability 0.6: Accept B.
- With probability 0.4: Reject B and sample from .
The recovery distribution is:
If we reject B, we always sample A. This makes sense: the draft over-represents B and under-represents A, so when we reject, we correct by sampling A.
Let's verify the overall probability of producing each token:
Actually, let me be more careful. For a single position, the probability of outputting token is:
For token A:
- Accept A directly:
- Reject B, sample A from :
- Reject C, sample A from :
- Total: ✓
For token B:
- Accept B directly:
- Sample B from : , so no contribution
- Total: ✓
For token C:
- Accept C directly:
- Sample C from : , so no contribution
- Total: ✓
The output distribution matches the target exactly. This is not a coincidence -- it's a mathematical guarantee.
Why It's Lossless: The Proof
Let's prove this in general. For any token in the vocabulary:
Case 1: (draft is over-confident)
The first term gives .
Since , we have .
So . ✓
Case 2: (draft is under-confident)
The first term gives .
For the second term, we need where .
The total rejection probability is:
Now, since :
(The total surplus where must equal the total deficit where , because both distributions sum to 1.)
So the second term becomes .
Total: . ✓
Both cases yield . The output distribution is exactly the target distribution, regardless of how good or bad the draft model is. A bad draft model just means more rejections (slower speed), not different output.
Expected Speedup: How Many Tokens Per Step?
The expected number of accepted tokens depends on the acceptance rate. Define as the acceptance probability of the -th draft token (which depends on how aligned and are at that position).
For simplicity, assume a constant acceptance rate across positions. The probability that the first tokens are all accepted is . The expected number of accepted tokens per speculation step is:
Plus the bonus token (either the -th from the target when all are accepted, or the recovery token when one is rejected):
Let's compute this for various acceptance rates with :
At 80% acceptance rate, we get nearly 4 tokens per verification step. If the draft model is 20x faster than the target, the speedup is significant.
The overall speedup formula:
where is the target model's forward pass latency and is the draft model's per-token latency.
Choosing the Draft Model
The choice of draft model is critical. You need a model that is simultaneously:
- Fast: The whole point is that draft steps are cheaper than target steps
- Accurate: Higher acceptance rate means more tokens per verification step
- Compatible: Must use the same tokenizer and vocabulary as the target model
These requirements create a fundamental tension. A more capable draft model has a higher acceptance rate but runs slower. A tiny draft model is lightning fast but might get rejected constantly.
Same Tokenizer Requirement
This is a hard constraint. If the draft model uses a different tokenizer, the tokens don't correspond, and verification is impossible. In practice, this usually means the draft model must be from the same model family.
Common pairings:
| Target Model | Draft Model | Speed Ratio | Typical |
|---|---|---|---|
| Llama 2 70B | Llama 2 7B | ~10x | 0.7-0.85 |
| GPT-4 | GPT-3.5 | ~5-8x | 0.6-0.8 |
| Codex 175B | Codex 12B | ~12x | 0.8-0.9 (code) |
| Gemma 2 27B | Gemma 2 2B | ~8x | 0.65-0.8 |
Note that code generation tends to have higher acceptance rates -- code is more predictable than natural language, with boilerplate, common patterns, and strict syntax.
The Speed-Accuracy Tradeoff
Key insight: with a bad draft model, the optimal is small. You should only speculate a few tokens ahead because most will be rejected anyway. With a strong draft model, you can speculate further and reap bigger rewards.
Beyond Small Models: Alternative Drafters
The draft model doesn't have to be a smaller neural network. Several alternatives exist:
N-gram models: A simple lookup table that predicts the next token based on the previous tokens. Essentially free to evaluate. Works surprisingly well for repetitive text, code, and common phrases. The acceptance rate is lower, but the near-zero cost compensates.
Retrieval-based drafters: Look up similar contexts in a database and predict the most likely continuation. This is the idea behind REST (He et al., 2023).
Quantized versions of the target: Aggressively quantize the target model (e.g., from fp16 to int4) and use the quantized version as the drafter. Same architecture, same tokenizer, much faster, but somewhat less accurate.
Medusa: No Separate Draft Model Needed
What if we could eliminate the draft model entirely? Medusa (Cai et al., 2024) does exactly this by adding multiple lightweight "prediction heads" directly to the target model.
The Idea
Standard LLMs have a single prediction head: the final linear layer that maps the last hidden state to vocabulary logits. This head predicts the next token.
Medusa adds additional heads, each trained to predict tokens further into the future:
- Head 1: predicts token (same as the original head)
- Head 2: predicts token given hidden state at position
- Head 3: predicts token given hidden state at position
- ...and so on
Each head is a small MLP (typically 1-2 layers) that operates on the same hidden state. The overhead of running extra heads is minimal compared to the rest of the model.
Tree Attention: Exploring Multiple Candidates
Here's where it gets clever. Each Medusa head provides a probability distribution over the vocabulary. Rather than greedily picking one token per head, Medusa takes the top- candidates from each head and forms a tree of possible continuations.
For example, with 3 heads and top-2 candidates each:
- Head 1 suggests:
{"quick": 0.45, "lazy": 0.30} - Head 2 suggests:
{"brown": 0.60, "red": 0.25} - Head 3 suggests:
{"fox": 0.72, "dog": 0.18}
This creates a tree with up to candidate paths. Using a specially constructed tree attention mask, the target model can verify all paths simultaneously in a single forward pass.
The tree attention mask is the key innovation. In standard causal attention, token can attend to tokens . In tree attention, each node in the tree can attend to its ancestors in the tree (its path from root to self) but not to nodes on other branches. This prevents information leaking between candidate paths.
The beauty of tree attention is that it transforms a sequential search (try path 1, then path 2, then path 3...) into a parallel verification. The model loads its weights once, processes all candidate paths, and determines which tokens to accept -- all in one pass.
Medusa Training
The Medusa heads are trained on the same data as the base model, but with an offset in the target labels:
Crucially, the base model weights are frozen during Medusa training. Only the lightweight heads are trained, typically requiring a few hours on a single GPU. This makes Medusa much cheaper to adopt than training a separate draft model from scratch.
Self-Speculative Decoding: The Model Drafts for Itself
Medusa requires training additional heads. Self-speculative decoding (Zhang et al., 2023) takes a different approach: use the target model itself as the drafter by performing early exit.
The idea is simple. A 70B model has, say, 80 transformer layers. The representations at layer 20 already encode a lot of information about the next token. If we attach a lightweight prediction head to layer 20, we get a "cheap" draft model that:
- Uses the same tokenizer (trivially)
- Shares most of the computation with the target model
- Requires no separate model in memory
The tradeoff is that the draft quality depends on which layer we exit at. Exit too early and the acceptance rate plummets. Exit too late and we save little computation. Research suggests that exiting around 25-40% of the way through the model provides a good balance.
Practical Speedups: When Does It Actually Help?
Speculative decoding sounds great in theory. In practice, the speedup depends heavily on the workload.
Best Case: Predictable Text
When the text is highly predictable -- code, boilerplate, common phrases, formatted data -- the draft model matches the target model well, acceptance rates are high, and speedups of 2-3x are common.
Worst Case: Creative/Diverse Text
When the text is creative, unusual, or requires reasoning, the draft model's predictions diverge from the target, acceptance rates drop, and the overhead of running the draft model eats into the speedup.
Batch Size Matters
This is the most important practical consideration. Speculative decoding shines at batch size 1 (single user, single request), where decode is heavily memory-bound. As batch size increases:
- The target model's forward pass becomes more compute-bound (processing many sequences)
- The cost of verification increases proportionally to the number of candidate tokens
- The draft model's overhead is no longer negligible
In production serving systems with continuous batching (like vLLM), the effective batch size is often large enough that speculative decoding provides diminished returns. The technique is most impactful for:
- Interactive single-user applications (chatbots, coding assistants)
- Latency-sensitive applications where throughput matters less
- Edge/mobile deployment where batch size is always 1
Temperature and Sampling
Speculative decoding works with any sampling strategy -- greedy, temperature sampling, top-, top- (nucleus). The key is that both the draft and target models must use the same sampling parameters.
At temperature 0 (greedy decoding), the acceptance rule simplifies: a draft token is accepted if and only if it matches the target's argmax. There's no probabilistic acceptance -- it's binary.
At higher temperatures, both distributions become flatter, which actually helps alignment. The acceptance rate tends to increase with temperature because both models assign more uniform probabilities, making their distributions more similar.
Full Implementation
Here's a complete PyTorch implementation of speculative decoding:
KV Cache Management
The trickiest part of implementation is managing the KV caches for both models. When tokens are rejected:
-
Target model cache: Must be truncated to remove entries for rejected tokens. In the simplified code above, we reset it entirely, but in practice you'd slice the cache tensors.
-
Draft model cache: Must be completely reset or rebuilt, because the draft model's internal state diverged from the accepted sequence at the point of rejection.
Efficient cache management is what makes the difference between a textbook implementation and a production-ready one. Systems like vLLM and TensorRT-LLM have sophisticated cache management specifically optimized for speculative decoding.
Advanced Variants
The field has evolved rapidly since the original speculative decoding papers. Here is a survey of the major variants.
Staged Speculative Decoding
Instead of one draft model, use a cascade: an extremely fast "level-0" drafter (n-gram model), a medium-speed "level-1" drafter (small neural network), and the target model for verification. Each stage filters candidates, reducing the work the target model needs to do.
Speculative Decoding with Tree Drafts
Instead of generating a single chain of tokens, the draft model generates a tree of candidates (similar to Medusa but using a separate draft model). The target model verifies all paths using tree attention. This increases the probability that at least one long path is fully accepted.
Online Speculative Decoding (Liu et al., 2024)
The draft model is continuously fine-tuned during serving to better match the target model's behavior on the current distribution of queries. As the draft model improves, acceptance rates increase over time.
SpecInfer (Miao et al., 2024)
Combines multiple draft models (of different architectures) into an ensemble, boosting the diversity of candidate tokens. The target model's single verification pass checks candidates from all draft models.
Lookahead Decoding (Fu et al., 2024)
Eliminates the draft model entirely by exploiting Jacobi iteration. The idea: guess all tokens simultaneously, run a forward pass, and check which positions converged. Repeat until all positions agree. Provably generates from the correct distribution.
Summary
Speculative decoding is a rare case of getting something for nothing. The output is mathematically identical to standard decoding -- same distribution, same quality, same everything -- but faster.
The key ideas:
-
Single-token decode is memory-bound. The GPU loads all model weights for tiny matrix-vector multiplies. Most compute capacity is wasted.
-
Verification is almost free. Processing tokens costs roughly the same as processing 1 token, because the bottleneck is weight loading, not arithmetic.
-
Draft-then-verify exploits the asymmetry. A cheap draft model proposes candidates; the expensive target model verifies in parallel.
-
Modified rejection sampling guarantees losslessness. The accept/reject/recovery mechanism produces samples from the exact target distribution.
-
The speedup scales with prediction quality. Predictable text (code, boilerplate) sees 2-3x speedup. Creative text sees 1.2-1.5x. Batch size 1 benefits most.
The technique has rapidly moved from research to production. As of early 2026, speculative decoding (or its variants like Medusa) is deployed in virtually every major LLM serving system: vLLM, TensorRT-LLM, HuggingFace TGI, and proprietary systems at OpenAI, Google, and Anthropic.
The autoregressive bottleneck may be fundamental to how language models work, but speculative decoding proves that the cost of that bottleneck is not. We can decode sequentially while still keeping our GPUs busy.
