Autoscaling ↗

fireworks guide intermediate configuration deployment

Summary: Configure how your deployment scales based on traffic

Original Documentation

Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.

Configure how your deployment scales based on traffic

Control how your deployment scales based on traffic and load.

Configuration options#

Flag	Type	Default	Description
`--min-replica-count`	Integer	0	Minimum number of replicas. Set to 0 for scale-to-zero
`--max-replica-count`	Integer	1	Maximum number of replicas
`--scale-up-window`	Duration	30s	Wait time before scaling up
`--scale-down-window`	Duration	10m	Wait time before scaling down
`--scale-to-zero-window`	Duration	1h	Idle time before scaling to zero (min: 5m)
`--load-targets`	Key-value	`default=0.8`	Scaling thresholds. See options below

Load target options (use as --load-targets <key>=<value>[,<key>=<value>...]):

default=<Fraction> - General load target from 0 to 1
tokens_generated_per_second=<Integer> - Desired tokens per second per replica
prompt_tokens_per_second=<Integer> - Desired prompt tokens per second per replica
requests_per_second=<Number> - Desired requests per second per replica
concurrent_requests=<Number> - Desired concurrent requests per replica

When multiple targets are specified, the maximum replica count across all is used.

Common patterns#

Scale to zero when idle to minimize costs:

    firectl deployment create <MODEL_NAME> \
      --min-replica-count 0 \
      --max-replica-count 3 \
      --scale-to-zero-window 1h
    ```

Best for: Development, testing, or intermittent production workloads.
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="Performance-focused"></span>
Keep replicas running for instant response:

```bash
    firectl deployment create <MODEL_NAME> \
      --min-replica-count 2 \
      --max-replica-count 10 \
      --scale-up-window 15s \
      --load-targets concurrent_requests=5
    ```

Best for: Low-latency requirements, avoiding cold starts, high-traffic applications.
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="Predictable traffic"></span>
Match known traffic patterns:

```bash
    firectl deployment create <MODEL_NAME> \
      --min-replica-count 3 \
      --max-replica-count 5 \
      --scale-down-window 30m \
      --load-targets tokens_generated_per_second=150
    ```

Best for: Steady workloads where you know typical load ranges.
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

## Scaling from zero behavior

When a deployment is scaled to zero and receives a request, the system immediately returns a `503` error with the `DEPLOYMENT_SCALING_UP` error code while initiating the scale-up process:

```json
{
  "error": {
    "message": "Deployment is currently scaled to zero and is scaling up. Please retry your request in a few minutes.",
    "code": "DEPLOYMENT_SCALING_UP",
    "type": "error"
  }
}

Requests to a scaled-to-zero deployment are not queued. Your application must implement retry logic to handle 503 responses while the deployment scales up.

Handling scale-from-zero responses#

Implement retry logic with exponential backoff to gracefully handle scale-up delays:

    import time
    import requests

    def query_deployment_with_retry(url, payload, max_retries=30, initial_delay=5):
        """Query a deployment with retry logic for scale-from-zero scenarios."""
        delay = initial_delay
        
        for attempt in range(max_retries):
            response = requests.post(url, json=payload, headers=headers)
            
            # Only retry if deployment is scaling up
            if response.status_code == 503:
                error_code = response.json().get("error", {}).get("code")
                if error_code == "DEPLOYMENT_SCALING_UP":
                    print(f"Deployment scaling up, retrying in {delay}s...")
                    time.sleep(delay)
                    delay = min(delay * 1.5, 60)  # Cap at 60 seconds
                    continue
                
            response.raise_for_status()
            return response.json()
        
        raise Exception("Deployment did not scale up in time")
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="JavaScript"></span>
```javascript
    async function queryDeploymentWithRetry(url, payload, maxRetries = 30, initialDelay = 5000) {
      let delay = initialDelay;
      
      for (let attempt = 0; attempt < maxRetries; attempt++) {
        const response = await fetch(url, {
          method: 'POST',
          headers: { 'Content-Type': 'application/json', ...headers },
          body: JSON.stringify(payload)
        });
        
        // Only retry if deployment is scaling up
        if (response.status === 503) {
          const body = await response.json();
          if (body.error?.code === 'DEPLOYMENT_SCALING_UP') {
            console.log(`Deployment scaling up, retrying in ${delay/1000}s...`);
            await new Promise(resolve => setTimeout(resolve, delay));
            delay = Math.min(delay * 1.5, 60000); // Cap at 60 seconds
            continue;
          }
        }
        
        if (!response.ok) throw new Error(`HTTP ${response.status}`);
        return response.json();
      }
      
      throw new Error('Deployment did not scale up in time');
    }
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="curl"></span>
```bash
    # Simple retry loop for scale-from-zero
    MAX_RETRIES=30
    RETRY_DELAY=5

    for i in $(seq 1 $MAX_RETRIES); do
      response=$(curl -s -w "\n%{http_code}" \
        https://api.fireworks.ai/inference/v1/chat/completions \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $FIREWORKS_API_KEY" \
        -d '{"model": "accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>", ...}')
      
      http_code=$(echo "$response" | tail -n1)
      body=$(echo "$response" | head -n -1)
      
      # Only retry if deployment is scaling up
      if [ "$http_code" -eq 503 ]; then
        error_code=$(echo "$body" | jq -r '.error.code // empty')
        if [ "$error_code" = "DEPLOYMENT_SCALING_UP" ]; then
          echo "Deployment scaling up, retrying in ${RETRY_DELAY}s..."
          sleep $RETRY_DELAY
          RETRY_DELAY=$((RETRY_DELAY * 2))
          continue
        fi

        echo "$body"
        exit 1
      fi
      
      # Check for success (2xx status codes)
      if [ "$http_code" -ge 200 ] && [ "$http_code" -lt 300 ]; then
        echo "$body"
        exit 0
      fi

      echo "$body"
      exit 1
    done

    echo "Deployment did not scale up in time"
    exit 1
    ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

<span class="callout-start" data-callout-type="tip"></span>
  Cold start times vary depending on model size—larger models may take longer to download and initialize. If you need instant responses without cold starts, set `--min-replica-count 1` or higher to keep replicas always running.
<span class="callout-end"></span>

<span class="callout-start" data-callout-type="note"></span>
  Deployments with min replicas = 0 are auto-deleted after 7 days of no traffic. [Reserved capacity](/deployments/reservations) guarantees availability during scale-up.
<span class="callout-end"></span>

Link last verified June 7, 2026. View original ↗

Source: Fireworks AI Docs

Link last verified: 2026-06-07