Prompt caching ↗
noOriginal Documentation
Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.
Prompt caching is a performance optimization feature that allows Fireworks to respond faster to requests with prompts that share common prefixes. In many situations, it can reduce time to first token (TTFT) by as much as 80%.
Prompt caching is enabled by default for all Fireworks models and deployments.
For serverless models, cached prompt tokens are discounted compared to regular prompt tokens. The default discount is 50%, but the exact discount varies by model. Check the Model Library for model-specific cached token pricing.
For dedicated deployments, prompt caching frees up resources, leading to higher throughput on the same hardware. Cached tokens on dedicated deployments are close to free for you, because they affect context length but do not need extra processing.
Using prompt caching#
Common use cases#
Requests to LLMs often share a large portion of their prompt. For example:
- Long system prompts with detailed instructions
- Descriptions of available tools for function calling
- Growing previous conversation history for chat use cases
- Shared per-user context, like a current file for a coding assistant
Prompt caching avoids re-processing the cached prefix of the prompt and starts output generation much sooner.
Structuring prompts for caching#
Prompt caching works only for exact prefix matches within a prompt. To realize caching benefits, place static content like instructions and examples at the beginning of your prompt, and put variable content, such as user-specific information, at the end.
For function calling models, tools are considered part of the prompt.
Optimizing inference request for caching#
Prompt caching only works within 1 replica. If you are using serverless or a deployment with multiple replicas, you need to give us hints for where to send the traffic to maximize prompt cache hit rates. To do this, you can send us a unique identifier for each user or session, when you expect the prompts to share the same prefix.
You may place this identifier either in the user field of the request body or
in the x-session-affinity header.
If you’re collecting RL rollouts and want per-trajectory metrics plus
stickiness across multi-turn agent interactions, also set
x-multi-turn-session-id. See Inference for RL
rollouts.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1",
)
response = client.chat.completions.create(
model="accounts/fireworks/models/<MODEL_ID>",
messages=[{
"role": "user",
"content": "Explain quantum computing in simple terms"
}],
# A single client can make requests with different sessions.
extra_headers={
"x-session-affinity": "session-id-123"
}
)
print(response.choices[0].message.content)
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="JavaScript"></span>
```javascript
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const response = await client.chat.completions.create({
model: "accounts/fireworks/models/<MODEL_ID>",
messages: [
{
role: "user",
content: "Explain quantum computing in simple terms",
},
],
// A single client can make requests with different sessions.
extra_headers: {
x-session-affinity: "session-id-123"
}
});
console.log(response.choices[0].message.content);
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="curl"></span>
```bash
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-H "x-session-affinity: session-id-123" \
-d '{
"model": "accounts/fireworks/models/",
"messages": [
{
"role": "user",
"content": "Explain quantum computing in simple terms"
}
]
}'
```
<span class="tab-end"></span>
<span class="tab-group-end"></span>
### Resetting or isolating cache state
When you want new traffic to stop reusing an older cached prefix, Fireworks supports a few options:
* Change the session key in `user` or `x-session-affinity`. This moves traffic onto a different sticky routing key.
* Set a new `x-prompt-cache-isolation-key` header or `prompt_cache_isolation_key` request field. This forces cache separation even when the prompt text is otherwise identical.
* For hot-load or weight-sync workflows, set `reset_prompt_cache` to `all`, `none`, or `new_session` when you publish a new snapshot.
## Prompt optimization techniques for maximum cache hits
Due to the autoregressive nature of LLMs, even a single-token difference can invalidate the cache from that token onward. Here are key strategies to maximize your cache hit rates:
### Keep your prompt prefix stable
The most critical rule for effective prompt caching is maintaining a stable prefix. Any change to the beginning of your prompt will invalidate the entire cache chain that follows.
<span class="callout-start" data-callout-type="warning"></span>
**Common mistake:** Including timestamps or other dynamic content at the beginning of your system prompt.
```python
# ❌ DON'T: This kills cache hit rates
system_prompt = f"""
Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
You are a helpful AI assistant...
"""Even a one-second difference in the timestamp will invalidate the entire cache, making it completely ineffective.
Structure prompts for caching success#
✅ DO: Place static content first, dynamic content last
# ✅ Good: Static content first
system_prompt = """
You are a helpful AI assistant with expertise in software development.
Your guidelines:
- Provide clear, concise explanations
- Include practical examples when helpful
- Ask clarifying questions when requirements are unclear
Available tools:
- web_search: Search the internet for current information
- code_executor: Run code snippets safely
- file_manager: Read and write files
"""
# Build the complete prompt
user_message = ""
# Add dynamic content at the end
if user_context:
user_message += f"User context: {user_context}\n\n"
if current_time_needed:
user_message += f"Current time: {datetime.now().isoformat()}\n\n"
# User query goes last
user_message += user_query
response = client.chat.completions.create(
model="accounts/fireworks/models/<MODEL_ID>",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
]
)Smart timestamp handling#
When you need to provide current time information, consider these strategies:
Option 1: Rounded timestamps
# ✅ Round to larger intervals to increase cache hits
current_hour = datetime.now().replace(minute=0, second=0, microsecond=0)
system_prompt = f"""
You are a helpful assistant.
Current hour: {current_hour.strftime('%Y-%m-%d %H:00')}
...
"""Option 2: Conditional time injection
# ✅ Only add time when the query actually needs it
def build_prompt(user_query, system_base):
prompt = system_base
# Only add timestamp for time-sensitive queries
time_keywords = ['today', 'now', 'current', 'latest', 'recent']
if any(keyword in user_query.lower() for keyword in time_keywords):
prompt += f"\nCurrent time: {datetime.now().isoformat()}"
prompt += f"\nUser: {user_query}"
return promptOption 3: Move time to user message
# ✅ Keep system prompt static, add time context to user message
system_prompt = """
You are a helpful AI assistant...
""" # This stays the same
user_message = f"""
Current time: {datetime.now().isoformat()}
User query: {user_query}
"""
response = client.chat.completions.create(
model="accounts/fireworks/models/<MODEL_ID>",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
]
)By following these optimization techniques, you can significantly improve your application’s performance through effective prompt caching while maintaining the quality and functionality of your AI system.
How it works#
Fireworks will automatically find the longest prefix of the request that is present in the cache and reuse it. The remaining portion of the prompt will be processed as usual.
The entire prompt is stored in the cache for future reuse. Cached prompts usually stay in the cache for at least several minutes. Depending on the model, load level, and deployment configuration, it can be up to several hours. The oldest prompts are evicted from the cache first.
Prompt caching doesn’t alter the result generated by the model. The response you receive will be identical to what you would get if prompt caching was not used. Each generation is sampled from the model independently on each request and is not cached for future usage.
Monitoring#
For dedicated deployments, information about prompt caching is returned in the
response headers. The header fireworks-prompt-tokens contains the number of tokens
in the prompt, out of which fireworks-cached-prompt-tokens are cached.
Aggregated metrics are also available in the usage dashboard.
Data privacy#
Serverless deployments maintain separate caches for each Fireworks account to prevent data leakage and timing attacks.
Dedicated deployments by default share a single cache across all requests.
Because prompt caching doesn’t change the outputs, privacy is preserved even
if the deployment powers a multi-tenant application. It does open a minor risk
of a timing attack: potentially, an adversary can learn that a particular prompt
is cached by observing the response time. To ensure full isolation, you can pass
the x-prompt-cache-isolation-key header or the prompt_cache_isolation_key
field in the body of the request. It can contain an arbitrary string that acts
as an additional cache key, i.e., no sharing will occur between requests with
different IDs.
Advanced: Migration and traffic management#
When migrating between deployments that use prompt caching, it’s crucial to implement proper traffic routing to maintain optimal cache hit rates. When gradually routing traffic to a new deployment, use consistent user/session-based sampling rather than random sampling.
Here’s the recommended implementation for traffic routing:
import hashlib
# Configure traffic fraction (e.g., 20% to new deployment)
fireworks_traffic_fraction = 0.2
user_id = "session-id-123"
# Generate deterministic hash from user_id
hashed_user_id = int(hashlib.md5(user_id.encode()).hexdigest(), 16) # MD5 hash on user-id and convert to integer
MAX_HASH = 2**128 - 1 # MD5 hash maximum value
# Compute ratio for consistent routing
ratio = hashed_user_id / MAX_HASH # Returns 0.0 to 1.0
if (ratio < fireworks_traffic_fraction):
send_to_new_deployment(user=hashed_user_id) # Pass user ID for caching
else:
send_elsewhere() # Route to old deployment or serverlessAvoid random sampling for traffic routing as it can negatively impact cache hit rates:
# Don't do this:
if random() < fireworks_traffic_fraction: # ❌ Reduces cache effectiveness
send_to_new_deployment(user=hashed_user_id)Using consistent user-based routing ensures complete user sessions are maintained on the same deployment, optimizing prompt cache performance regardless of the traffic fraction.