Inference for RL Rollouts ↗

fireworks guide intermediate ide deployment

Summary: Session affinity, weight-swap behavior, and MoE Router Replay for rollout traffic on Fireworks inference deployments.

Original Documentation

Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.

Session affinity, weight-swap behavior, and MoE Router Replay for rollout traffic on Fireworks inference deployments.

When you use Fireworks inference to collect RL rollouts, the regular /v1/completions and /v1/chat/completions endpoints expose a few extra features tailored to multi-turn, stateful rollout traffic. You can use these whether or not the underlying deployment is a hot-load deployment.

These features are fully compatible with the OpenAI SDKs — they’re all attached as either request headers or optional body fields, so no SDK upgrade is required.

Session affinity#

Multi-turn rollouts typically reuse a long prefix between turns (same system prompt, same trajectory so far). To get the KV cache to hit, all turns of a trajectory should land on the same inference replica. Two headers are relevant here:

x-multi-turn-session-id — identifies the agent trajectory. Set this once per trajectory and keep it constant across turns. If both headers are present, Fireworks currently prefers this value when deriving the request’s session-affinity key.
x-session-affinity — fallback sticky routing key when x-multi-turn-session-id is absent. In most RL rollout setups, set it to the same trajectory ID.

    from openai import OpenAI

    client = OpenAI(
        api_key="<FIREWORKS_API_KEY>",
        base_url="https://api.fireworks.ai/inference/v1",
    )

    trajectory_id = "traj-42f1"

    for turn in trajectory:
        response = client.chat.completions.create(
            model="accounts/<account_id>/models/<model_id>",
            messages=turn.messages,
            extra_headers={
                "x-multi-turn-session-id": trajectory_id,
                "x-session-affinity": trajectory_id,
                "fireworks-deployment": "accounts/<account_id>/deployments/<deployment_id>",
            },
        )
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="curl"></span>
```bash
    curl https://api.fireworks.ai/inference/v1/chat/completions \
      -H "Authorization: Bearer <fireworks_api_key>" \
      -H "fireworks-model: accounts/<account_id>/models/<model_id>" \
      -H "fireworks-deployment: accounts/<account_id>/deployments/<deployment_id>" \
      -H "x-multi-turn-session-id: traj-42f1" \
      -H "x-session-affinity: traj-42f1" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "accounts/<account_id>/models/<model_id>",
        "messages": [{"role": "user", "content": "..."}]
      }'
    ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

<span class="callout-start" data-callout-type="tip"></span>
  `x-session-affinity` on its own is already documented for general [prompt
  caching](/guides/prompt-caching#optimizing-inference-request-for-caching). In
  RL rollouts you typically also want `x-multi-turn-session-id` so that per-turn
  metrics (TTFT, generation latency) are aggregated by trajectory, while
  preserving the current serving preference when both headers are supplied.
<span class="callout-end"></span>

## Behavior during weight swap

If your rollout traffic hits a hot-load deployment, a new checkpoint can arrive mid-rollout. What happens to your requests depends on the deployment's configured transition mode:

* **Async transition (recommended for RL):** in-flight requests pause then resume on the same HTTP connection using the new weights. The active turn keeps its current KV state, so it continues rather than restarting. New requests queue up. You see elevated TTFT but no errors.
* **Synchronous transition:** in-flight requests finish on the old weights; new requests get HTTP `425 Too Early` until the swap is done. Your client should retry with back-off, ideally keeping the same session-affinity key so it lands on a replica that has already finished the swap.

`reset_prompt_cache` only affects what future requests or session IDs can reuse after the swap. See [Checkpoint-swap behavior](/fine-tuning/rl-rollout-debugging#checkpoint-swap-behavior) for the full semantics.

## MoE Router Replay

For Mixture-of-Experts models, training-inference divergence often comes from the router picking different top-K experts at the same token position between trainer and inference. Aligning those choices across rollouts and training is known as [Rollout Router Replay (R3)](https://arxiv.org/abs/2510.11370).

Fireworks inference supports returning the selected MoE experts for every token and every MoE layer. Pass `include_routing_matrix: true` together with `logprobs: true` on your request:

```bash
curl https://api.fireworks.ai/inference/v1/chat/completions \
  -H "Authorization: Bearer <fireworks_api_key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "accounts/<account_id>/models/<model_id>",
    "messages": [{"role": "user", "content": "..."}],
    "include_routing_matrix": true,
    "logprobs": true
  }'

The selected expert indices for each token are returned alongside logprobs. For /v1/chat/completions you find them at choices[i].logprobs.content[j].routing_matrix; for /v1/completions the structure is analogous. Each value is a flattened, base64-encoded uint8 array of shape [num_layers_with_moe, num_active_experts].

Example response (DeepSeek V3)#

{
  "object": "text_completion",
  "model": "...my-deepseek-v3-model...",
  "choices": [
    {
      "index": 0,
      "logprobs": {
        "content": [
          {
            "token": " ",
            "logprob": -0.00014507,
            "sampling_logprob": -0.0001450882,
            "token_id": 223,
            "routing_matrix": "CYvWPzaOl8g/o7q2XPVTMJ7w/Y8G..."
          }
        ]
      }
    }
  ]
}

Decoding the routing matrix#

DeepSeek V3 has 58 MoE layers (the first 3 of 61 total are dense) and selects 8 active experts per token, so each decoded buffer is 58 * 8 = 464 bytes.

import base64
import numpy as np

num_layers_with_moe = 58
num_active_experts = 8

encoded = choice["logprobs"]["content"][0]["routing_matrix"]
raw_bytes = base64.b64decode(encoded)
routing_matrix = np.frombuffer(raw_bytes, dtype=np.uint8).reshape(
    num_layers_with_moe, num_active_experts
)
# routing_matrix[layer_idx] -> array of 8 expert indices for that token

Other API modes#

Completions API (/v1/completions): same mechanism — include_routing_matrix and logprobs are top-level body fields.
Streaming (stream: true): routing_matrix is included on each streamed token chunk’s logprobs.content entry.
Prompt tokens (echo: true): returns expert selection for the prompt tokens too. Combine with echo_last: N to only include expert selection for the last N prompt tokens.

Policy version in responses#

On hot-load deployments, track which snapshot served each token—useful for off-policy RL and debugging stale rollouts.

Streaming#

Each streamed chunk includes the loaded snapshot in the model field as accounts/<account_id>/models/<model_id>@<snapshot_identity>:

data: {"object":"text_completion","model":"accounts/<account_id>/models/<model_id>@version_002","choices":[{"index":0,"text":"...","finish_reason":null}],...}

Parse the suffix after @ as the policy version for that token. If a weight swap happens mid-stream under async transition, later chunks may reflect the new snapshot.

Non-streaming#

Non-streaming responses are adding the same model@snapshot_identity convention; until your deployment shape exposes it, rely on streaming or correlate rollout timing with your hot-load poll timestamps.

Prerequisites, hot-load deployment, and rollout loop.

ARC2 compression and incremental hot-load signals.

Detailed semantics of request behavior across weight swaps.

Session-affinity patterns for general cache hit optimization.

Link last verified June 7, 2026. View original ↗

Source: Fireworks AI Docs

Link last verified: 2026-06-07