Speculative Decoding ↗
noSummary: Speed up generation with draft models and n-gram speculation
Original Documentation
Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.
Speed up generation with draft models and n-gram speculation
Speed up text generation by using a smaller “draft” model to assist the main model, or using n-gram based speculation.
Speculative decoding may slow down output generation if the draft model is not a good speculator, or if token count/speculation length is too high or too low. It may also reduce max throughput. Test different models and speculation lengths for your use case.
Configuration options#
| Flag | Type | Description |
|---|---|---|
--draft-model | string | Draft model name. Can be a Fireworks model or custom model. See recommendations below. |
--draft-token-count | int32 | Tokens to generate per step. Required when using draft model or n-gram. Typically set to 4. |
--ngram-speculation-length | int32 | Alternative to draft model: uses N-gram based speculation from previous input. |
--draft-model and --ngram-speculation-length cannot be used together.
Recommended draft models#
| Draft model | Use with |
|---|---|
accounts/fireworks/models/llama-v3p2-1b-instruct | All Llama models > 3B |
accounts/fireworks/models/qwen2p5-0p5b-instruct | All Qwen models > 3B |
Examples#
Use a smaller model to speed up generation:
firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \
--draft-model="accounts/fireworks/models/llama-v3p2-1b-instruct" \
--draft-token-count=4
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="N-gram speculation"></span>
Use input history for speculation (no draft model needed):
```bash
firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \
--ngram-speculation-length=3 \
--draft-token-count=4
```
<span class="tab-end"></span>
<span class="tab-group-end"></span>
<span class="callout-start" data-callout-type="tip"></span>
Fireworks also supports [Predicted Outputs](/guides/predicted-outputs) which works in addition to model-based speculative decoding.
<span class="callout-end"></span>Link last verified
June 7, 2026.
View original ↗
Source: Fireworks AI Docs
Link last verified: 2026-06-07