Speculative Decoding ↗

fireworks spec intermediate models deployment

Summary: Speed up generation with draft models and n-gram speculation

Original Documentation

Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.

Speed up generation with draft models and n-gram speculation

Speed up text generation by using a smaller “draft” model to assist the main model, or using n-gram based speculation.

Speculative decoding may slow down output generation if the draft model is not a good speculator, or if token count/speculation length is too high or too low. It may also reduce max throughput. Test different models and speculation lengths for your use case.

Configuration options#

Flag	Type	Description
`--draft-model`	string	Draft model name. Can be a Fireworks model or custom model. See recommendations below.
`--draft-token-count`	int32	Tokens to generate per step. Required when using draft model or n-gram. Typically set to 4.
`--ngram-speculation-length`	int32	Alternative to draft model: uses N-gram based speculation from previous input.

--draft-model and --ngram-speculation-length cannot be used together.

Recommended draft models#

Draft model	Use with
`accounts/fireworks/models/llama-v3p2-1b-instruct`	All Llama models > 3B
`accounts/fireworks/models/qwen2p5-0p5b-instruct`	All Qwen models > 3B

Examples#

Use a smaller model to speed up generation:

    firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \
      --draft-model="accounts/fireworks/models/llama-v3p2-1b-instruct" \
      --draft-token-count=4
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="N-gram speculation"></span>
Use input history for speculation (no draft model needed):

```bash
    firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \
      --ngram-speculation-length=3 \
      --draft-token-count=4
    ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

<span class="callout-start" data-callout-type="tip"></span>
  Fireworks also supports [Predicted Outputs](/guides/predicted-outputs) which works in addition to model-based speculative decoding.
<span class="callout-end"></span>

Link last verified June 7, 2026. View original ↗

Source: Fireworks AI Docs

Link last verified: 2026-06-07