Autoscaling

no
Summary: Configure how your deployment scales based on traffic

Original Documentation

Documentation Index#

Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.

Configure how your deployment scales based on traffic

Control how your deployment scales based on traffic and load.

Configuration options#

FlagTypeDefaultDescription
--min-replica-countInteger0Minimum number of replicas. Set to 0 for scale-to-zero
--max-replica-countInteger1Maximum number of replicas
--scale-up-windowDuration30sWait time before scaling up
--scale-down-windowDuration10mWait time before scaling down
--scale-to-zero-windowDuration1hIdle time before scaling to zero (min: 5m)
--load-targetsKey-valuedefault=0.8Scaling thresholds. See options below

Load target options (use as --load-targets <key>=<value>[,<key>=<value>...]):

  • default=<Fraction> - General load target from 0 to 1
  • tokens_generated_per_second=<Integer> - Desired tokens per second per replica
  • prompt_tokens_per_second=<Integer> - Desired prompt tokens per second per replica
  • requests_per_second=<Number> - Desired requests per second per replica
  • concurrent_requests=<Number> - Desired concurrent requests per replica

When multiple targets are specified, the maximum replica count across all is used.

Common patterns#

Scale to zero when idle to minimize costs:

    firectl deployment create <MODEL_NAME> \
      --min-replica-count 0 \
      --max-replica-count 3 \
      --scale-to-zero-window 1h
    ```

Best for: Development, testing, or intermittent production workloads.
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="Performance-focused"></span>
Keep replicas running for instant response:

```bash
    firectl deployment create <MODEL_NAME> \
      --min-replica-count 2 \
      --max-replica-count 10 \
      --scale-up-window 15s \
      --load-targets concurrent_requests=5
    ```

Best for: Low-latency requirements, avoiding cold starts, high-traffic applications.
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="Predictable traffic"></span>
Match known traffic patterns:

```bash
    firectl deployment create <MODEL_NAME> \
      --min-replica-count 3 \
      --max-replica-count 5 \
      --scale-down-window 30m \
      --load-targets tokens_generated_per_second=150
    ```

Best for: Steady workloads where you know typical load ranges.
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

## Scaling from zero behavior

When a deployment is scaled to zero and receives a request, the system immediately returns a `503` error with the `DEPLOYMENT_SCALING_UP` error code while initiating the scale-up process:

```json
{
  "error": {
    "message": "Deployment is currently scaled to zero and is scaling up. Please retry your request in a few minutes.",
    "code": "DEPLOYMENT_SCALING_UP",
    "type": "error"
  }
}

Requests to a scaled-to-zero deployment are not queued. Your application must implement retry logic to handle 503 responses while the deployment scales up.

Handling scale-from-zero responses#

Implement retry logic with exponential backoff to gracefully handle scale-up delays:

    import time
    import requests

    def query_deployment_with_retry(url, payload, max_retries=30, initial_delay=5):
        """Query a deployment with retry logic for scale-from-zero scenarios."""
        delay = initial_delay
        
        for attempt in range(max_retries):
            response = requests.post(url, json=payload, headers=headers)
            
            # Only retry if deployment is scaling up
            if response.status_code == 503:
                error_code = response.json().get("error", {}).get("code")
                if error_code == "DEPLOYMENT_SCALING_UP":
                    print(f"Deployment scaling up, retrying in {delay}s...")
                    time.sleep(delay)
                    delay = min(delay * 1.5, 60)  # Cap at 60 seconds
                    continue
                
            response.raise_for_status()
            return response.json()
        
        raise Exception("Deployment did not scale up in time")
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="JavaScript"></span>
```javascript
    async function queryDeploymentWithRetry(url, payload, maxRetries = 30, initialDelay = 5000) {
      let delay = initialDelay;
      
      for (let attempt = 0; attempt < maxRetries; attempt++) {
        const response = await fetch(url, {
          method: 'POST',
          headers: { 'Content-Type': 'application/json', ...headers },
          body: JSON.stringify(payload)
        });
        
        // Only retry if deployment is scaling up
        if (response.status === 503) {
          const body = await response.json();
          if (body.error?.code === 'DEPLOYMENT_SCALING_UP') {
            console.log(`Deployment scaling up, retrying in ${delay/1000}s...`);
            await new Promise(resolve => setTimeout(resolve, delay));
            delay = Math.min(delay * 1.5, 60000); // Cap at 60 seconds
            continue;
          }
        }
        
        if (!response.ok) throw new Error(`HTTP ${response.status}`);
        return response.json();
      }
      
      throw new Error('Deployment did not scale up in time');
    }
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="curl"></span>
```bash
    # Simple retry loop for scale-from-zero
    MAX_RETRIES=30
    RETRY_DELAY=5

    for i in $(seq 1 $MAX_RETRIES); do
      response=$(curl -s -w "\n%{http_code}" \
        https://api.fireworks.ai/inference/v1/chat/completions \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $FIREWORKS_API_KEY" \
        -d '{"model": "accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>", ...}')
      
      http_code=$(echo "$response" | tail -n1)
      body=$(echo "$response" | head -n -1)
      
      # Only retry if deployment is scaling up
      if [ "$http_code" -eq 503 ]; then
        error_code=$(echo "$body" | jq -r '.error.code // empty')
        if [ "$error_code" = "DEPLOYMENT_SCALING_UP" ]; then
          echo "Deployment scaling up, retrying in ${RETRY_DELAY}s..."
          sleep $RETRY_DELAY
          RETRY_DELAY=$((RETRY_DELAY * 2))
          continue
        fi

        echo "$body"
        exit 1
      fi
      
      # Check for success (2xx status codes)
      if [ "$http_code" -ge 200 ] && [ "$http_code" -lt 300 ]; then
        echo "$body"
        exit 0
      fi

      echo "$body"
      exit 1
    done

    echo "Deployment did not scale up in time"
    exit 1
    ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

<span class="callout-start" data-callout-type="tip"></span>
  Cold start times vary depending on model sizelarger models may take longer to download and initialize. If you need instant responses without cold starts, set `--min-replica-count 1` or higher to keep replicas always running.
<span class="callout-end"></span>

<span class="callout-start" data-callout-type="note"></span>
  Deployments with min replicas = 0 are auto-deleted after 7 days of no traffic. [Reserved capacity](/deployments/reservations) guarantees availability during scale-up.
<span class="callout-end"></span>
Link last verified June 7, 2026. View original ↗
Source: Fireworks AI Docs
Link last verified: 2026-06-07