Deployments

no
Summary: Configure and manage on-demand deployments on dedicated GPUs

Original Documentation

Documentation Index#

Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.

Configure and manage on-demand deployments on dedicated GPUs

New to deployments? Start with our Deployments Quickstart to deploy and query your first model in minutes, then return here to learn about configuration options.

On-demand deployments give you dedicated GPUs for your models, providing several advantages over serverless:

  • Better performance – Lower latency, higher throughput, and predictable performance unaffected by other users
  • No hard rate limits – Only limited by your deployment’s capacity
  • Cost-effective at scale – Cheaper under high utilization. Unlike serverless models (billed per token), on-demand deployments are billed by GPU-second.
  • Broader model selection – Access models not available on serverless
  • Custom models – Upload your own models (for supported architectures) from Hugging Face or elsewhere

Need higher GPU quotas or want to reserve capacity? Contact us.

Creating & querying deployments#

Create a deployment:

# This command returns your accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID> - save it for querying
firectl deployment create accounts/fireworks/models/<MODEL_NAME> --wait

Deployment placement (--region) must be set at creation time and cannot be changed in place.

If you do not specify --region, the deployment is pinned to a single datacenter at creation time and will not be automatically migrated later.

For production workloads that need geographic availability or capacity failover, always set --region explicitly:

firectl deployment create accounts/fireworks/models/<MODEL_NAME> --region GLOBAL   # recommended default
firectl deployment create accounts/fireworks/models/<MODEL_NAME> --region US
firectl deployment create accounts/fireworks/models/<MODEL_NAME> --region EUROPE
firectl deployment create accounts/fireworks/models/<MODEL_NAME> --region APAC

Check current placement#

firectl deployment get <DEPLOYMENT_ID>

The deployment metadata shows where the deployment is currently allowed to schedule replicas (placement / region configuration).

Change placement#

There is no supported command to change region placement on an existing deployment. To change placement, recreate the deployment:

# 1. Create replacement with correct region
firectl deployment create accounts/fireworks/models/<MODEL_NAME> \
  --deployment-shape <shape> \
  --region GLOBAL \
  --min-replica-count 1

# 2. Verify it's healthy, then point your app at the new endpoint

# 3. Delete old deployment
firectl deployment delete <OLD_DEPLOYMENT_ID>

See Regions for mega-regions and hardware availability.

See Deployment shapes below to optimize for speed, throughput, or cost.

Query your deployment:

After creating a deployment, query it using this format:

accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>

You can find your deployment name anytime with firectl deployment list and firectl deployment get <DEPLOYMENT_ID>.

Example:

accounts/alice/deployments/12345678

Code examples#

    from fireworks import Fireworks

    client = Fireworks()

    response = client.chat.completions.create(
      model="accounts/fireworks/models/gpt-oss-120b#<DEPLOYMENT_NAME>",
      messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}]
    )

    print(response.choices[0].message.content)
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="Python (OpenAI SDK)"></span>
```python
    import os
    from openai import OpenAI

    client = OpenAI(
        api_key=os.environ.get("FIREWORKS_API_KEY"),
        base_url="https://api.fireworks.ai/inference/v1"
    )

    response = client.chat.completions.create(
        model="accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
        messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}]
    )

    print(response.choices[0].message.content)
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="JavaScript"></span>
```javascript
    import OpenAI from "openai";

    const client = new OpenAI({
      apiKey: process.env.FIREWORKS_API_KEY,
      baseURL: "https://api.fireworks.ai/inference/v1",
    });

    const response = await client.chat.completions.create({
      model: "accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
      messages: [
        {
          role: "user",
          content: "Explain quantum computing in simple terms",
        },
      ],
    });

    console.log(response.choices[0].message.content);
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="curl"></span>
```bash
    curl https://api.fireworks.ai/inference/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $FIREWORKS_API_KEY" \
      -d '{
        "model": "accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
        "messages": [
          {
            "role": "user",
            "content": "Explain quantum computing in simple terms"
          }
        ]
      }'
    ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

### Deployment status states

Deployment states from the Gateway API spec:

* `CREATING` - still being created
* `READY` - ready to be used
* `UPDATING` - in-progress updates happening
* `DELETING` - being deleted
* `DELETED` - soft-deleted
* `FAILED` - creation failed (see status for details)

UI-only states are display labels derived from deployment fields:

* `Inactive`: `state == READY && max_replica_count == 0 && ready_replica_count == 0`
* `Scaled to 0`: `state == READY && min_replica_count == 0 && max_replica_count > 0 && desired_replica_count == 0 && ready_replica_count == 0`

These are display labels computed from deployment fields; they are not new backend `Deployment.State` enum values.

## Deployment shapes

Deployment shapes are the primary way to configure deployments. They're pre-configured templates optimized for speed, cost, or efficiency, including hardware, quantization, and other [performance factors](/faq/deployment/performance/optimization#performance-factors).

* **Fast**  Low latency for interactive workloads
* **Throughput**  Cost-per-token at scale for high-volume workloads
* **Minimal**  Lowest cost for testing or light workloads

**Usage:**

```bash
# List available shapes
firectl deployment-shape-version list --base-model <model-id>

# Create with a shape (shorthand)
firectl deployment create accounts/fireworks/models/deepseek-v3 --deployment-shape throughput

# Create with full shape ID
firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \
  --deployment-shape accounts/fireworks/deploymentShapes/llama-v3p3-70b-instruct-fast

# View shape details
firectl deployment-shape-version get <full-deployment-shape-version-id>

Need even better performance with tailored optimizations? Contact our team.

Managing & configuring deployments#

Basic management#

# List all deployments
firectl deployment list

# Check deployment status
firectl deployment get <DEPLOYMENT_ID>

# Delete a deployment
firectl deployment delete <DEPLOYMENT_ID>

By default, deployments scale to zero if unused for 1 hour. Deployments with min replicas set to 0 are automatically deleted after 7 days of no traffic.

When a deployment is scaled to zero, requests return a 503 error immediately while the deployment scales up. Your application should implement retry logic to handle this. See Scaling from zero behavior for implementation details.

GPU hardware#

Choose GPU type with --accelerator-type:

  • NVIDIA_A100_80GB
  • NVIDIA_H100_80GB
  • NVIDIA_H200_141GB

GPU availability varies by region. See Hardware selection guide→

Autoscaling#

Control replica counts, scale timing, and load targets for your deployment.

See the Autoscaling guide for configuration options.

Multiple GPUs per replica#

Use multiple GPUs to improve latency and throughput:

firectl deployment create <MODEL_NAME> --accelerator-count 2

More GPUs = faster generation. Note that scaling is sub-linear (2x GPUs ≠ 2x performance).

Advanced#

Next steps#

Configure autoscaling for optimal cost and performance

Deploy your own models from Hugging Face

Reduce costs with model quantization

Choose deployment regions for optimal latency

Purchase reserved GPUs for guaranteed capacity

Fine-tune models for your specific use case

Link last verified June 7, 2026. View original ↗
Source: Fireworks AI Docs
Link last verified: 2026-06-07