Deployments Quickstart

no
Summary: Deploy models on dedicated GPUs in minutes

Original Documentation

Documentation Index#

Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.

Deploy models on dedicated GPUs in minutes

On-demand deployments are dedicated GPUs that give you better performance, no rate limits, fast autoscaling, and a wider selection of models than serverless. This quickstart will help you spin up your first on-demand deployment in minutes.

Step 1: Create and export an API key#

Before you begin, create an API key in the Fireworks dashboard. Click Create API key and store it in a safe location.

Once you have your API key, export it as an environment variable in your terminal:

    export FIREWORKS_API_KEY="your_api_key_here"
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="Windows"></span>
```powershell
    setx FIREWORKS_API_KEY "your_api_key_here"
    ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

## Step 2: Install the CLI

To create and manage on-demand deployments, you'll need the `firectl` CLI tool. Install it using one of the following methods, based on your platform:


  ```bash
  brew tap fw-ai/firectl
  brew install firectl

  # If you encounter a failed SHA256 check, try first running
  brew update
curl https://storage.googleapis.com/fireworks-public/firectl/stable/darwin-arm64.gz -o firectl.gz
gzip -d firectl.gz && chmod a+x firectl
sudo mv firectl /usr/local/bin/firectl
sudo chown root: /usr/local/bin/firectl
curl https://storage.googleapis.com/fireworks-public/firectl/stable/darwin-amd64.gz -o firectl.gz
gzip -d firectl.gz && chmod a+x firectl
sudo mv firectl /usr/local/bin/firectl
sudo chown root: /usr/local/bin/firectl
wget -O firectl.gz https://storage.googleapis.com/fireworks-public/firectl/stable/linux-amd64.gz
gunzip firectl.gz
sudo install -o root -g root -m 0755 firectl /usr/local/bin/firectl
wget -L https://storage.googleapis.com/fireworks-public/firectl/stable/firectl.exe

Then, sign in:

firectl signin

Step 3: Create a deployment#

This command will create a deployment of GPT OSS 120B optimized for speed. It will take a few minutes to complete. The resulting deployment will scale up to 1 replica.

firectl deployment create accounts/fireworks/models/gpt-oss-120b \
        --deployment-shape fast \
        --scale-down-window 5m \
        --scale-up-window 30s \
        --min-replica-count 0 \
        --max-replica-count 1 \
        --scale-to-zero-window 5m \
        --wait

fast is called a deployment shape, which is a pre-configured deployment template created by the Fireworks team that sets sensible defaults for most deployment options (such as hardware type).

You can also pass throughput or cost to --deployment-shape:

  • throughput creates a deployment that trades off latency for lower cost-per-token at scale
  • cost creates a deployment that trades off latency and throughput for lowest cost-per-token at small scale, usually for early experimentation and prototyping

While we recommend using a deployment shape, you are also free to pass your own configuration to the deployment via our deployment guide.

The response will look like this:

Name: accounts/<YOUR ACCOUNT ID>/deployments/<DEPLOYMENT ID>
Create Time: <CREATION_TIME>
Expire Time: <EXPIRATION_TIME>
Created By: <YOUR EMAIL>
State: CREATING
Status: OK
Min Replica Count: 0
Max Replica Count: 1
Desired Replica Count: 0
Replica Count: 0
Autoscaling Policy:
  Scale Up Window: 30s
  Scale Down Window: 5m0s
  Scale To Zero Window: 5m0s
Base Model: accounts/fireworks/models/gpt-oss-120b
...other fields...

Take note of the Name: field in the response, as it will be used in the next step to query your deployment.

Learn more about deployment options→

Learn more about autoscaling options→

Step 4: Query your deployment#

Now you can query your on-demand deployment using the same API as serverless models, but using your dedicated deployment. Replace <DEPLOYMENT_NAME> in the below snippets with the value from the Name: field in the previous step:

Install the Fireworks Python SDK:

The SDK is currently in alpha. Use the --pre flag when installing to get the latest version.

    pip install --pre fireworks-ai
    ```

```bash
    poetry add --pre fireworks-ai
    ```

```bash
    uv add --pre fireworks-ai
    ```


Then make your first on-demand API call:

```python
  from fireworks import Fireworks

  client = Fireworks()

  response = client.chat.completions.create(
      model="accounts/fireworks/models/gpt-oss-120b#<DEPLOYMENT_NAME>",
      messages=[{
          "role": "user",
          "content": "Explain quantum computing in simple terms",
      }],
  )

  print(response.choices[0].message.content)
  ```
<span class="tab-end"></span>

<span class="tab-start" data-tab-title="Python (OpenAI SDK)"></span>
```python
  import os
  from openai import OpenAI

  client = OpenAI(
      api_key=os.environ.get("FIREWORKS_API_KEY"),
      base_url="https://api.fireworks.ai/inference/v1"
  )

  response = client.chat.completions.create(
      model="<DEPLOYMENT_NAME>",
      messages=[{
          "role": "user",
          "content": "Explain quantum computing in simple terms",
      }],
  )

  print(response.choices[0].message.content)
  ```
<span class="tab-end"></span>

<span class="tab-start" data-tab-title="JavaScript"></span>
```javascript
  import OpenAI from "openai";

  const client = new OpenAI({
    apiKey: process.env.FIREWORKS_API_KEY,
    baseURL: "https://api.fireworks.ai/inference/v1",
  });

  const response = await client.chat.completions.create({
    model: "<DEPLOYMENT_NAME>",
    messages: [
      {
        role: "user",
        content: "Explain quantum computing in simple terms",
      },
    ],
  });

  console.log(response.choices[0].message.content);
  ```
<span class="tab-end"></span>

<span class="tab-start" data-tab-title="curl"></span>
```bash
  curl https://api.fireworks.ai/inference/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $FIREWORKS_API_KEY" \
    -d '{
      "model": "<DEPLOYMENT_NAME>",
      "messages": [
        {
          "role": "user",
          "content": "Explain quantum computing in simple terms"
        }
      ]
    }'
  ```
<span class="tab-end"></span>
<span class="tab-group-end"></span>

The examples from the Serverless quickstart will work with this deployment as well, just replace the model string with the deployment-specific model string from above.

[Serverless quickstart→](/getting-started/quickstart)

## Common use cases

### Autoscale based on requests per second

```bash
firectl deployment create accounts/fireworks/models/gpt-oss-120b \
      --deployment-shape fast \
      --scale-down-window 5m \
      --scale-up-window 30s \
      --scale-to-zero-window 5m \
      --min-replica-count 0 \
      --max-replica-count 4 \
      --load-targets requests_per_second=5 \
      --wait

Autoscale based on concurrent requests#

firectl deployment create accounts/fireworks/models/gpt-oss-120b \
        --deployment-shape fast \
        --scale-down-window 5m \
        --scale-up-window 30s \
        --scale-to-zero-window 5m \
        --min-replica-count 0 \
        --max-replica-count 4 \
        --load-targets concurrent_requests=5 \
        --wait

Next steps#

Ready to scale to production, explore other modalities, or customize your models?

Bring your own model and deploy it on Fireworks

Improve model quality with supervised and reinforcement learning

Use embeddings & reranking in search & context retrieval

Run async inference jobs at scale, faster and cheaper

Explore all available models across modalities

Complete API documentation

Link last verified June 7, 2026. View original ↗
Source: Fireworks AI Docs
Link last verified: 2026-06-07