Deployments ↗
noOriginal Documentation
Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.
Configure and manage on-demand deployments on dedicated GPUs
New to deployments? Start with our Deployments Quickstart to deploy and query your first model in minutes, then return here to learn about configuration options.
On-demand deployments give you dedicated GPUs for your models, providing several advantages over serverless:
- Better performance – Lower latency, higher throughput, and predictable performance unaffected by other users
- No hard rate limits – Only limited by your deployment’s capacity
- Cost-effective at scale – Cheaper under high utilization. Unlike serverless models (billed per token), on-demand deployments are billed by GPU-second.
- Broader model selection – Access models not available on serverless
- Custom models – Upload your own models (for supported architectures) from Hugging Face or elsewhere
Need higher GPU quotas or want to reserve capacity? Contact us.
Creating & querying deployments#
Create a deployment:
# This command returns your accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID> - save it for querying
firectl deployment create accounts/fireworks/models/<MODEL_NAME> --wait
Deployment placement (--region) must be set at creation time and cannot be changed in place.
If you do not specify --region, the deployment is pinned to a single datacenter at creation time and will not be automatically migrated later.
For production workloads that need geographic availability or capacity failover, always set --region explicitly:
firectl deployment create accounts/fireworks/models/<MODEL_NAME> --region GLOBAL # recommended default
firectl deployment create accounts/fireworks/models/<MODEL_NAME> --region US
firectl deployment create accounts/fireworks/models/<MODEL_NAME> --region EUROPE
firectl deployment create accounts/fireworks/models/<MODEL_NAME> --region APACCheck current placement#
firectl deployment get <DEPLOYMENT_ID>The deployment metadata shows where the deployment is currently allowed to schedule replicas (placement / region configuration).
Change placement#
There is no supported command to change region placement on an existing deployment. To change placement, recreate the deployment:
# 1. Create replacement with correct region
firectl deployment create accounts/fireworks/models/<MODEL_NAME> \
--deployment-shape <shape> \
--region GLOBAL \
--min-replica-count 1
# 2. Verify it's healthy, then point your app at the new endpoint
# 3. Delete old deployment
firectl deployment delete <OLD_DEPLOYMENT_ID>See Regions for mega-regions and hardware availability.
See Deployment shapes below to optimize for speed, throughput, or cost.
Query your deployment:
After creating a deployment, query it using this format:
accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>
You can find your deployment name anytime with firectl deployment list and firectl deployment get <DEPLOYMENT_ID>.
Example:
accounts/alice/deployments/12345678Code examples#
from fireworks import Fireworks
client = Fireworks()
response = client.chat.completions.create(
model="accounts/fireworks/models/gpt-oss-120b#<DEPLOYMENT_NAME>",
messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}]
)
print(response.choices[0].message.content)
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="Python (OpenAI SDK)"></span>
```python
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1"
)
response = client.chat.completions.create(
model="accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}]
)
print(response.choices[0].message.content)
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="JavaScript"></span>
```javascript
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const response = await client.chat.completions.create({
model: "accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
messages: [
{
role: "user",
content: "Explain quantum computing in simple terms",
},
],
});
console.log(response.choices[0].message.content);
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="curl"></span>
```bash
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
"messages": [
{
"role": "user",
"content": "Explain quantum computing in simple terms"
}
]
}'
```
<span class="tab-end"></span>
<span class="tab-group-end"></span>
### Deployment status states
Deployment states from the Gateway API spec:
* `CREATING` - still being created
* `READY` - ready to be used
* `UPDATING` - in-progress updates happening
* `DELETING` - being deleted
* `DELETED` - soft-deleted
* `FAILED` - creation failed (see status for details)
UI-only states are display labels derived from deployment fields:
* `Inactive`: `state == READY && max_replica_count == 0 && ready_replica_count == 0`
* `Scaled to 0`: `state == READY && min_replica_count == 0 && max_replica_count > 0 && desired_replica_count == 0 && ready_replica_count == 0`
These are display labels computed from deployment fields; they are not new backend `Deployment.State` enum values.
## Deployment shapes
Deployment shapes are the primary way to configure deployments. They're pre-configured templates optimized for speed, cost, or efficiency, including hardware, quantization, and other [performance factors](/faq/deployment/performance/optimization#performance-factors).
* **Fast** – Low latency for interactive workloads
* **Throughput** – Cost-per-token at scale for high-volume workloads
* **Minimal** – Lowest cost for testing or light workloads
**Usage:**
```bash
# List available shapes
firectl deployment-shape-version list --base-model <model-id>
# Create with a shape (shorthand)
firectl deployment create accounts/fireworks/models/deepseek-v3 --deployment-shape throughput
# Create with full shape ID
firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \
--deployment-shape accounts/fireworks/deploymentShapes/llama-v3p3-70b-instruct-fast
# View shape details
firectl deployment-shape-version get <full-deployment-shape-version-id>Need even better performance with tailored optimizations? Contact our team.
Managing & configuring deployments#
Basic management#
# List all deployments
firectl deployment list
# Check deployment status
firectl deployment get <DEPLOYMENT_ID>
# Delete a deployment
firectl deployment delete <DEPLOYMENT_ID>By default, deployments scale to zero if unused for 1 hour. Deployments with min replicas set to 0 are automatically deleted after 7 days of no traffic.
When a deployment is scaled to zero, requests return a 503 error immediately while the deployment scales up. Your application should implement retry logic to handle this. See Scaling from zero behavior for implementation details.
GPU hardware#
Choose GPU type with --accelerator-type:
NVIDIA_A100_80GBNVIDIA_H100_80GBNVIDIA_H200_141GB
GPU availability varies by region. See Hardware selection guide→
Autoscaling#
Control replica counts, scale timing, and load targets for your deployment.
See the Autoscaling guide for configuration options.
Multiple GPUs per replica#
Use multiple GPUs to improve latency and throughput:
firectl deployment create <MODEL_NAME> --accelerator-count 2More GPUs = faster generation. Note that scaling is sub-linear (2x GPUs ≠ 2x performance).
Advanced#
- Speculative decoding - Speed up text generation using draft models or n-gram speculation
- Quantization - Reduce model precision (e.g., FP16 to FP8) to improve speeds and reduce costs by 30-50%
- Performance benchmarking - Measure and optimize your deployment’s performance with load testing
- Managing default deployments - Control which deployment handles queries when using just the model name
- Publishing deployments - Make your deployment accessible to other Fireworks users
Next steps#
Configure autoscaling for optimal cost and performance
Deploy your own models from Hugging Face
Reduce costs with model quantization
Choose deployment regions for optimal latency
Purchase reserved GPUs for guaranteed capacity
Fine-tune models for your specific use case