Serverless Overview ↗
noOriginal Documentation
Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.
How Serverless inference works on Fireworks: serving paths, billing, request/response headers, prompt caching, model lifecycle, and when to choose Serverless over On-demand
What is Serverless#
Serverless is multi-tenant inference for popular open models running on Fireworks-managed infrastructure. You point your client at api.fireworks.ai, send tokens, and pay only for what you use — no GPUs to size, no autoscaler to tune, no cold starts to wait through. Models eligible for Serverless carry the Serverless tag in the model library. To make your first call, see the Serverless quickstart.
Serverless products at a glance#
Three serving paths run on the same Serverless framework. They share the same rate-limit policy, but route and price differently:
- Standard — the default serving path. No
service_tierparameter needed. - Priority — higher reliability during peak periods. Opt in by setting
service_tier: "priority"on chat completions. Priced at a premium. - Fast — high-speed deployments for latency-sensitive workloads. Selected by switching the
modelID to a Fast variant (for example,accounts/fireworks/routers/kimi-k2p6-fast).
For usage examples and the full list of supported models, see Serverless Serving Paths. For pricing by serving path, see Serverless pricing.
Billing#
Serverless is priced per token. Three dimensions are billed:
- Input tokens — what you send to the model.
- Cached input tokens — input tokens served from prompt cache, discounted (default 50% of input on text and vision models, unless a model lists a different cached rate).
- Generated tokens — what the model produces.
Other things to know:
- The
usageobject in each response is the source of truth for what was billed (prompt_tokens,completion_tokens,total_tokens). - Batch inference is billed at 50% of standard Serverless rates on both input and output. See Batch inference.
- Your spend tier influences Serverless capacity caps in addition to your monthly budget — higher spend tiers unlock higher TPM upper bounds. See Account quotas and Serverless rate limits.
Request and response headers#
Headers a Serverless caller will set or read.
Request headers#
| Header | Notes |
|---|---|
Authorization: Bearer $FIREWORKS_API_KEY | Required for all requests. |
x-session-affinity | Optional sticky-routing key. Pin repeated requests to the same replica to maximize prompt-cache hit rate. See Prompt caching. |
Response headers#
Fireworks sets the following on Serverless inference responses:
| Header | What it tells you |
|---|---|
fireworks-prompt-tokens | Input tokens for the request. |
fireworks-cached-prompt-tokens | Cached portion of the input. See Prompt caching. |
X-Ratelimit-Limit-Tokens-Prompt | Your current Total Prompt Tokens per Second Limit. |
X-Ratelimit-Limit-Tokens-Cache-Adjusted-Prompt | Your current Total Uncached Prompt Tokens per Second Limit. |
X-Ratelimit-Limit-Tokens-Generated | Your current Total Generated Tokens per Second Limit. |
Streaming responses don’t carry per-request perf headers. To get the same metrics in the streaming response body, set the perf_metrics_in_response parameter on the request. See Querying text models.
Prompt caching#
Prompt caching is on by default for every Serverless model. Cached input tokens are billed at the discounted rate (default 50% of input). Caching is replica-local, so to maximize hit rate you should route repeated prompts to the same replica — pass a stable identifier in x-session-affinity (or in the OpenAI user field) for each user or session whose prompts share a prefix.
For the full guide, including how to structure prompts for cache hits and how to read cache metrics, see Prompt caching.
Serverless model lifecycle#
Serverless models are managed by the Fireworks team and may be updated or deprecated as new models are released. We provide at least 2 weeks advance notice before removing any model, with longer notice periods for popular models based on usage.
For production workloads requiring long-term model stability, we recommend on-demand deployments, which give you full control over model versions and updates.
Serverless vs On-demand#
| When Serverless fits | When On-demand fits |
|---|---|
| Pay per token, only for what you use | Pay per GPU-hour for dedicated capacity |
| You’re using popular base models that Fireworks already hosts | You’re running custom base models or fine-tuned LoRA models (LoRA requires On-demand) |
| You don’t want to manage scaling, replicas, or hardware sizing | You have custom latency requirements and want control over hardware and replicas |
For dedicated infrastructure, see On-demand deployments.
Next steps#
Make your first Serverless API call.
Higher-reliability and higher-speed serving paths.
Per-token rates for text, vision, embeddings, and Priority.
Adaptive TPM bounds and how the limit ramps with usage.
How caching works and how to maximize hit rate.
Dedicated GPUs for predictable throughput and custom models.