Quantization

no
Summary: Reduce model precision to improve performance and lower costs

Original Documentation

Documentation Index#

Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.

Reduce model precision to improve performance and lower costs

Quantization reduces the number of bits used to serve a model, improving performance and reducing cost by 30-50%. However, this can change model numerics which may introduce small changes to the output.

Read our blog post for a detailed treatment of how quantization affects model quality.

Checking available precisions#

Models may support different numerical precisions like FP16, FP8, BF16, or INT8, which affect memory usage and inference speed.

Check default precision:

firectl model get accounts/fireworks/models/llama-v3p1-8b-instruct | grep "Default Precision"

Check supported precisions:

firectl model get accounts/fireworks/models/llama-v3p1-8b-instruct | grep -E "(Supported Precisions|Supported Precisions With Calibration)"

The Precisions field indicates what precisions the model has been prepared for.

Quantizing a model#

A model can be quantized to 8-bit floating-point (FP8) precision.

    firectl prepare-model <MODEL_ID>
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="Python (REST API)"></span>
```python
    import os
    import requests

    ACCOUNT_ID = os.environ.get("FIREWORKS_ACCOUNT_ID")
    API_KEY = os.environ.get("FIREWORKS_API_KEY")
    MODEL_ID = "<YOUR_MODEL_ID>" # The ID of the model you want to prepare

    response = requests.post(
      f"https://api.fireworks.ai/v1/accounts/{ACCOUNT_ID}/models/{MODEL_ID}:prepare",
      headers={
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
      },
      json={
        "precision": "FP8"
      }
    )

    print(response.json())
    ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

<span class="callout-start" data-callout-type="note"></span>This is an additive process that enables creating deployments with additional precisions. The original FP16 checkpoint is still available for use.<span class="callout-end"></span>

You can check on the status of preparation by running:

<span class="tab-group-start"></span>
  <span class="tab-start" data-tab-title="firectl"></span>
```bash
    firectl model get <MODEL_ID>
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="Python (REST API)"></span>
```python
    import os
    import requests

    ACCOUNT_ID = os.environ.get("FIREWORKS_ACCOUNT_ID")
    API_KEY = os.environ.get("FIREWORKS_API_KEY")
    MODEL_ID = "<YOUR_MODEL_ID>" # The ID of the model you want to get

    response = requests.get(
      f"https://api.fireworks.ai/v1/accounts/{ACCOUNT_ID}/models/{MODEL_ID}",
      headers={
        "Authorization": f"Bearer {API_KEY}"
      }
    )

    print(response.json())
    ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

and checking if the state is still in `PREPARING`. A successfully prepared model will have the desired precision added
to the `Precisions` list.

## Creating an FP8 deployment

By default, creating a deployment uses the FP16 checkpoint. To use a quantized FP8 checkpoint, first ensure the model has been prepared for FP8 (see [Checking available precisions](#checking-available-precisions) above), then pass the `--precision` flag when creating your deployment:

<span class="tab-group-start"></span>
  <span class="tab-start" data-tab-title="firectl"></span>
```bash
    firectl deployment create <MODEL> --accelerator-type NVIDIA_H100_80GB --precision FP8
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="Python (REST API)"></span>
```python
    import os
    import requests

    ACCOUNT_ID = os.environ.get("FIREWORKS_ACCOUNT_ID")
    API_KEY = os.environ.get("FIREWORKS_API_KEY")
    # The ID of the model you want to deploy.
    # The model must be prepared for FP8 precision.
    MODEL_ID = "<YOUR_MODEL_ID>"
    DEPLOYMENT_NAME = "My FP8 Deployment"

    response = requests.post(
      f"https://api.fireworks.ai/v1/accounts/{ACCOUNT_ID}/deployments",
      headers={
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
      },
      json={
        "displayName": DEPLOYMENT_NAME,
        "baseModel": MODEL_ID,
        "acceleratorType": "NVIDIA_H100_80GB",
        "precision": "FP8",
      }
    )

    print(response.json())
    ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

<span class="callout-start" data-callout-type="note"></span>Quantized deployments can only be served using H100 GPUs.<span class="callout-end"></span>
Link last verified June 7, 2026. View original ↗
Source: Fireworks AI Docs
Link last verified: 2026-06-07