Speculative Decoding

no
Summary: Speed up generation with draft models and n-gram speculation

Original Documentation

Documentation Index#

Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.

Speed up generation with draft models and n-gram speculation

Speed up text generation by using a smaller “draft” model to assist the main model, or using n-gram based speculation.

Speculative decoding may slow down output generation if the draft model is not a good speculator, or if token count/speculation length is too high or too low. It may also reduce max throughput. Test different models and speculation lengths for your use case.

Configuration options#

FlagTypeDescription
--draft-modelstringDraft model name. Can be a Fireworks model or custom model. See recommendations below.
--draft-token-countint32Tokens to generate per step. Required when using draft model or n-gram. Typically set to 4.
--ngram-speculation-lengthint32Alternative to draft model: uses N-gram based speculation from previous input.

--draft-model and --ngram-speculation-length cannot be used together.

Draft modelUse with
accounts/fireworks/models/llama-v3p2-1b-instructAll Llama models > 3B
accounts/fireworks/models/qwen2p5-0p5b-instructAll Qwen models > 3B

Examples#

Use a smaller model to speed up generation:

    firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \
      --draft-model="accounts/fireworks/models/llama-v3p2-1b-instruct" \
      --draft-token-count=4
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="N-gram speculation"></span>
Use input history for speculation (no draft model needed):

```bash
    firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \
      --ngram-speculation-length=3 \
      --draft-token-count=4
    ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

<span class="callout-start" data-callout-type="tip"></span>
  Fireworks also supports [Predicted Outputs](/guides/predicted-outputs) which works in addition to model-based speculative decoding.
<span class="callout-end"></span>
Link last verified June 7, 2026. View original ↗
Source: Fireworks AI Docs
Link last verified: 2026-06-07