Autoscaling ↗
noSummary: Configure how your deployment scales based on traffic
Original Documentation
Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.
Configure how your deployment scales based on traffic
Control how your deployment scales based on traffic and load.
Configuration options#
| Flag | Type | Default | Description |
|---|---|---|---|
--min-replica-count | Integer | 0 | Minimum number of replicas. Set to 0 for scale-to-zero |
--max-replica-count | Integer | 1 | Maximum number of replicas |
--scale-up-window | Duration | 30s | Wait time before scaling up |
--scale-down-window | Duration | 10m | Wait time before scaling down |
--scale-to-zero-window | Duration | 1h | Idle time before scaling to zero (min: 5m) |
--load-targets | Key-value | default=0.8 | Scaling thresholds. See options below |
Load target options (use as --load-targets <key>=<value>[,<key>=<value>...]):
default=<Fraction>- General load target from 0 to 1tokens_generated_per_second=<Integer>- Desired tokens per second per replicaprompt_tokens_per_second=<Integer>- Desired prompt tokens per second per replicarequests_per_second=<Number>- Desired requests per second per replicaconcurrent_requests=<Number>- Desired concurrent requests per replica
When multiple targets are specified, the maximum replica count across all is used.
Common patterns#
Scale to zero when idle to minimize costs:
firectl deployment create <MODEL_NAME> \
--min-replica-count 0 \
--max-replica-count 3 \
--scale-to-zero-window 1h
```
Best for: Development, testing, or intermittent production workloads.
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="Performance-focused"></span>
Keep replicas running for instant response:
```bash
firectl deployment create <MODEL_NAME> \
--min-replica-count 2 \
--max-replica-count 10 \
--scale-up-window 15s \
--load-targets concurrent_requests=5
```
Best for: Low-latency requirements, avoiding cold starts, high-traffic applications.
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="Predictable traffic"></span>
Match known traffic patterns:
```bash
firectl deployment create <MODEL_NAME> \
--min-replica-count 3 \
--max-replica-count 5 \
--scale-down-window 30m \
--load-targets tokens_generated_per_second=150
```
Best for: Steady workloads where you know typical load ranges.
<span class="tab-end"></span>
<span class="tab-group-end"></span>
## Scaling from zero behavior
When a deployment is scaled to zero and receives a request, the system immediately returns a `503` error with the `DEPLOYMENT_SCALING_UP` error code while initiating the scale-up process:
```json
{
"error": {
"message": "Deployment is currently scaled to zero and is scaling up. Please retry your request in a few minutes.",
"code": "DEPLOYMENT_SCALING_UP",
"type": "error"
}
}
Requests to a scaled-to-zero deployment are not queued. Your application must implement retry logic to handle 503 responses while the deployment scales up.
Handling scale-from-zero responses#
Implement retry logic with exponential backoff to gracefully handle scale-up delays:
import time
import requests
def query_deployment_with_retry(url, payload, max_retries=30, initial_delay=5):
"""Query a deployment with retry logic for scale-from-zero scenarios."""
delay = initial_delay
for attempt in range(max_retries):
response = requests.post(url, json=payload, headers=headers)
# Only retry if deployment is scaling up
if response.status_code == 503:
error_code = response.json().get("error", {}).get("code")
if error_code == "DEPLOYMENT_SCALING_UP":
print(f"Deployment scaling up, retrying in {delay}s...")
time.sleep(delay)
delay = min(delay * 1.5, 60) # Cap at 60 seconds
continue
response.raise_for_status()
return response.json()
raise Exception("Deployment did not scale up in time")
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="JavaScript"></span>
```javascript
async function queryDeploymentWithRetry(url, payload, maxRetries = 30, initialDelay = 5000) {
let delay = initialDelay;
for (let attempt = 0; attempt < maxRetries; attempt++) {
const response = await fetch(url, {
method: 'POST',
headers: { 'Content-Type': 'application/json', ...headers },
body: JSON.stringify(payload)
});
// Only retry if deployment is scaling up
if (response.status === 503) {
const body = await response.json();
if (body.error?.code === 'DEPLOYMENT_SCALING_UP') {
console.log(`Deployment scaling up, retrying in ${delay/1000}s...`);
await new Promise(resolve => setTimeout(resolve, delay));
delay = Math.min(delay * 1.5, 60000); // Cap at 60 seconds
continue;
}
}
if (!response.ok) throw new Error(`HTTP ${response.status}`);
return response.json();
}
throw new Error('Deployment did not scale up in time');
}
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="curl"></span>
```bash
# Simple retry loop for scale-from-zero
MAX_RETRIES=30
RETRY_DELAY=5
for i in $(seq 1 $MAX_RETRIES); do
response=$(curl -s -w "\n%{http_code}" \
https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{"model": "accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>", ...}')
http_code=$(echo "$response" | tail -n1)
body=$(echo "$response" | head -n -1)
# Only retry if deployment is scaling up
if [ "$http_code" -eq 503 ]; then
error_code=$(echo "$body" | jq -r '.error.code // empty')
if [ "$error_code" = "DEPLOYMENT_SCALING_UP" ]; then
echo "Deployment scaling up, retrying in ${RETRY_DELAY}s..."
sleep $RETRY_DELAY
RETRY_DELAY=$((RETRY_DELAY * 2))
continue
fi
echo "$body"
exit 1
fi
# Check for success (2xx status codes)
if [ "$http_code" -ge 200 ] && [ "$http_code" -lt 300 ]; then
echo "$body"
exit 0
fi
echo "$body"
exit 1
done
echo "Deployment did not scale up in time"
exit 1
```
<span class="tab-end"></span>
<span class="tab-group-end"></span>
<span class="callout-start" data-callout-type="tip"></span>
Cold start times vary depending on model size—larger models may take longer to download and initialize. If you need instant responses without cold starts, set `--min-replica-count 1` or higher to keep replicas always running.
<span class="callout-end"></span>
<span class="callout-start" data-callout-type="note"></span>
Deployments with min replicas = 0 are auto-deleted after 7 days of no traffic. [Reserved capacity](/deployments/reservations) guarantees availability during scale-up.
<span class="callout-end"></span>Link last verified
June 7, 2026.
View original ↗
Source: Fireworks AI Docs
Link last verified: 2026-06-07