Price comparison vs Tinker ↗

fireworks guide intermediate agents fine-tuning cost-management

Summary: Estimate the cost of multi-turn agentic RL rollouts on Fireworks compared to Tinker's per-token pricing

Original Documentation

Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.

Estimate the cost of multi-turn agentic RL rollouts on Fireworks compared to Tinker’s per-token pricing

export const MultiTurnCostCalculator = () => { const TINKER = { “kimi-k2p6”: { prefill: 5.15, sample: 12.81, label: “Kimi K2.6 (128K)” }, “kimi-k2p5”: { prefill: 5.15, sample: 12.81, label: “Kimi K2.5 (128K)” }, “qwen3p5-397b”: { prefill: 4.0, sample: 10.0, label: “Qwen3.5-397B-A17B (256K)” }, “gpt-oss-120b”: { prefill: 0.63, sample: 1.54, label: “GPT-OSS-120B (128K)” } }; const FW_GPU = { H100_80GB: 7.0, H200_141GB: 7.0, B200_180GB: 10.0, B300_288GB: 12.0 }; const DEDICATED_DEFAULTS = { “kimi-k2p6”: { gpus: 8, gpuType: “B300_288GB”, promptTokensPerSec: 30000, decodeTokensPerSec: 1500 }, “kimi-k2p5”: { gpus: 8, gpuType: “B300_288GB”, promptTokensPerSec: 30000, decodeTokensPerSec: 1500 }, “qwen3p5-397b”: { gpus: 8, gpuType: “B200_180GB”, promptTokensPerSec: 20000, decodeTokensPerSec: 1200 }, “gpt-oss-120b”: { gpus: 4, gpuType: “H200_141GB”, promptTokensPerSec: 25000, decodeTokensPerSec: 2500 } }; const [modelKey, setModelKey] = useState(“kimi-k2p6”); const [episodes, setEpisodes] = useState(1000); const [turnsPerEpisode, setTurnsPerEpisode] = useState(8); const [initContext, setInitContext] = useState(8000); const [tokensAddedPerTurn, setTokensAddedPerTurn] = useState(12000); const [decodePerTurn, setDecodePerTurn] = useState(2000); const dedDefaults = DEDICATED_DEFAULTS[modelKey]; const [gpus, setGpus] = useState(dedDefaults.gpus); const [promptTps, setPromptTps] = useState(dedDefaults.promptTokensPerSec); const [decodeTps, setDecodeTps] = useState(dedDefaults.decodeTokensPerSec); const onModelChange = k => { setModelKey(k); const d = DEDICATED_DEFAULTS[k]; setGpus(d.gpus); setPromptTps(d.promptTokensPerSec); setDecodeTps(d.decodeTokensPerSec); }; const T = turnsPerEpisode; const sumPromptTokens = T * initContext + tokensAddedPerTurn * (T * (T - 1)) / 2; const finalContext = initContext + (T - 1) * tokensAddedPerTurn; const decodeTokens = T * decodePerTurn; const tinkerPromptTokens = sumPromptTokens; const tinkerDecodeTokens = decodeTokens; const fwUncachedPromptTokens = finalContext; const fwCachedPromptTokens = Math.max(0, sumPromptTokens - finalContext); const cacheHitPct = sumPromptTokens > 0 ? fwCachedPromptTokens / sumPromptTokens * 100 : 0; const fmt = n => $${n.toFixed(n < 10 ? 2 : 0)}; const M = 1_000_000; const tinker = TINKER[modelKey]; const tinkerPerEpisode = tinkerPromptTokens / M * tinker.prefill + tinkerDecodeTokens / M * tinker.sample; const tinkerTotal = tinkerPerEpisode * episodes; const totalUncachedPromptTokens = fwUncachedPromptTokens * episodes; const totalDecodeTokens = decodeTokens * episodes; const prefillTimeSec = totalUncachedPromptTokens / promptTps; const decodeTimeSec = totalDecodeTokens / decodeTps; const dedicatedSec = Math.max(prefillTimeSec, decodeTimeSec); const dedicatedHours = dedicatedSec / 3600; const gpuRate = FW_GPU[dedDefaults.gpuType] || 10.0; const dedicatedTotal = dedicatedHours * gpus * gpuRate; const compareLabel = (tinker, fw) => { if (!(fw > 0) || !(tinker > 0)) return “—”; const r = tinker / fw; if (r >= 1.05) return r.toFixed(1) + “× cheaper”; if (r <= 0.95) return (1 / r).toFixed(1) + “× more”; return “at par”; }; const tinkerSec = tinkerTotal * 3600 / (gpus * gpuRate); const promptTpsParity = totalUncachedPromptTokens / Math.max(tinkerSec, 1); const decodeTpsParity = totalDecodeTokens / Math.max(tinkerSec, 1); const baseSec = Math.max(totalUncachedPromptTokens / promptTps, totalDecodeTokens / decodeTps); const gpusParityMax = Math.max(1, Math.floor(tinkerTotal * 3600 / (baseSec * gpuRate))); const labelCls = “text-sm font-medium text-zinc-700 dark:text-zinc-300”; const inputCls = “w-full mt-2 rounded-lg border border-zinc-300 dark:border-zinc-600 bg-white dark:bg-zinc-800 px-3 py-2 text-sm text-zinc-900 dark:text-zinc-100”; return

  <label>Model</label>
  <select value={modelKey} onChange={e => onModelChange(e.target.value)}>
    {Object.entries(TINKER).map(([k, v]) => <option key={k} value={k}>{v.label}</option>)}
  </select>



  <label>Episodes</label>
  <input type="number" min={1} max={10_000_000} value={episodes} onChange={e => setEpisodes(Math.max(1, Number(e.target.value) || 1))} />


  <label>Turns per episode</label>
  <input type="number" min={1} max={200} value={turnsPerEpisode} onChange={e => setTurnsPerEpisode(Math.max(1, Number(e.target.value) || 1))} />


  <label>Initial prompt (tokens)</label>
  <input type="number" min={0} value={initContext} onChange={e => setInitContext(Math.max(0, Number(e.target.value) || 0))} />


  <label>Tokens added per turn (gen + tool result)</label>
  <input type="number" min={0} value={tokensAddedPerTurn} onChange={e => setTokensAddedPerTurn(Math.max(0, Number(e.target.value) || 0))} />


  <label>Generation tokens per turn</label>
  <input type="number" min={0} value={decodePerTurn} onChange={e => setDecodePerTurn(Math.max(0, Number(e.target.value) || 0))} />

Advanced — dedicated cluster sizing & throughput (defaults are saturated estimates)

    <label>
      Dedicated GPUs ({dedDefaults.gpuType})
      <span> · max {gpusParityMax} (at-par with Tinker)</span>
    </label>
    <input type="number" min={1} max={gpusParityMax} value={gpus} onChange={e => {

const v = Number(e.target.value) || 1; setGpus(Math.min(gpusParityMax, Math.max(1, v))); }} />

    <label>
      Prompt tokens/sec (cluster)
      <span> · min {Math.ceil(promptTpsParity).toLocaleString()} (at-par)</span>
    </label>
    <input type="number" min={Math.ceil(promptTpsParity)} value={promptTps} onChange={e => {

const v = Number(e.target.value) || promptTpsParity; setPromptTps(Math.max(Math.ceil(promptTpsParity), v)); }} />

    <label>
      Decode tokens/sec (cluster)
      <span> · min {Math.ceil(decodeTpsParity).toLocaleString()} (at-par)</span>
    </label>
    <input type="number" min={Math.ceil(decodeTpsParity)} value={decodeTps} onChange={e => {

const v = Number(e.target.value) || decodeTpsParity; setDecodeTps(Math.max(Math.ceil(decodeTpsParity), v)); }} />

{}

  <strong>Per-episode token math:</strong>
  {sumPromptTokens.toLocaleString()} prompt-tokens-summed (Tinker bills all of these as prefill);
  {finalContext.toLocaleString()} unique tokens prefilled on Fireworks (rest hit the prefix cache);
  {decodeTokens.toLocaleString()} decode tokens.
  <strong>Cache hit:</strong> {cacheHitPct.toFixed(1)}%

{}

{[{

key: “tinker”, title: “Tinker”, subtitle: “per-token, no cross-turn cache”, perEp: tinkerPerEpisode, total: tinkerTotal, vs: “1×”, highlight: false }, { key: “fw-ded”, title: “Fireworks Dedicated”, subtitle: ${gpus}× ${dedDefaults.gpuType.replace("_", " ")} · $${gpuRate}/GPU-hr · ≈ ${dedicatedHours.toFixed(2)} cluster-hours at chosen throughput, perEp: dedicatedTotal / episodes, total: dedicatedTotal, vs: compareLabel(tinkerTotal, dedicatedTotal), highlight: true }].map(row =>

      {row.title}
    
    
      {row.subtitle}
    
    
      
        $ / episode
        
          {fmt(row.perEp)}
        
      
      
        
          $ / {episodes.toLocaleString()} ep
        
        
          {fmt(row.total)}
        
      
      
        vs. Tinker
        
          {row.vs}
        
      
    
  )}

Order-of-magnitude estimates. Pricing constants reflect rates last verified 2026-05-16: Tinker rates from thinkingmachines.ai/tinker, Fireworks GPU-hour from fireworks.ai/pricing. Dedicated throughput defaults are saturated estimates; tune them in the Advanced panel for your workload.

; };

If you’re running RL or agentic post-training on a long-context model and your provider bills you per token with no cross-turn prefix cache, the prefill cost grows quadratically with the number of turns — every turn re-prefills the full conversation history. On Fireworks Dedicated, session-affinity routing keeps an episode pinned to one replica so the KV cache is reused across turns, and cached prompt tokens contribute essentially zero extra compute.

The calculator below makes that difference concrete. Set your episode shape (turns, context growth, generation length) and compare:

Tinker — flat per-token billing, no cross-turn cache (re-prefill every turn)
Fireworks Dedicated — on-demand GPU-hour billing; the cache savings show up as more work per hour, not as a discounted token rate

Performance and benchmarking notes#

Dedicated trainer vs pooled/serverless resourcing#

Tinker runs training jobs on a pooled/serverless GPU fleet, which lets a single job burst onto many more GPUs than you would dedicate to a replica on Fireworks. That burst is what makes individual Tinker steps feel fast — but it also caps the maximum training speed you can buy: you cannot pay to scale beyond the pool’s per-job allocation, and you cannot reserve isolated capacity.

Fireworks dedicated trainers take the opposite trade-off: predictable, isolated execution with no shared-pool queueing or noisy-neighbor variance, and the ability to scale wall-clock time and cost independently by adjusting replica count. If you want faster steps on dedicated, increase replica count and parallelize work.

For large model training or longer rollouts, we have consistently found the dedicated setup like ours is cheaper overall and can also be faster depending on the customer’s resourcing needs.

Context-length benchmarking caveat#

Benchmark comparisons are only apples-to-apples when truncation policy and effective context length are matched. If one system truncates >32k samples and another does not, the non-truncating run is doing more work and will appear slower.

Replica count is a speed/cost knob#

Users can trade cost and wall-clock time by scaling replicas. A quick back-of-envelope estimate:

$$ \text{$ / 1M tokens} \approx \frac{\text{GPU count} \cdot \text{$ / GPU-hour}}{\text{tokens/sec(cluster)} \cdot 3600} \cdot 10^6 $$

How the numbers come together#

Tinker (the cost customers describe)#

Each turn re-prefills the full accumulated context:

$$ \text{Prefill tokens (Tinker)} = \sum_{t=1}^{T} P_t = T \cdot P_1 + \Delta \cdot \frac{T(T-1)}{2} $$

…where $P_1$ is the initial prompt (system + tools + task), $\Delta$ is the context added per turn (model response + tool result), and $T$ is the turn count. This is quadratic in $T$.

$$ \text{Cost (Tinker)} = \frac{\text{Prefill tokens}}{10^6} \cdot r_{\text{prefill}} + \frac{\text{Decode tokens}}{10^6} \cdot r_{\text{sample}} $$

Fireworks Dedicated — GPU-hour billing#

Dedicated deployments are billed per GPU-second, so the prefix cache shows up as higher effective throughput rather than a discount on per-token rates. Across one episode, each unique token is prefilled at most once — the rest of the prompt is served from the prefix cache and contributes essentially no GPU work. The uncached portion that actually hits prefill is:

$$ \text{Uncached prompt} = P_T = P_1 + (T - 1) \Delta $$

On a saturated cluster:

$$ \text{Cluster-hours} = \frac{\text{Uncached prompt} / \text{prefill TPS}}{3600} $$

$$ \text{Cost} = \text{Cluster-hours} \cdot N_{\text{GPU}} \cdot r_{\text{GPU/hr}} $$

Because cached tokens contribute essentially nothing to wall-clock work, the cluster’s effective $/M token rate falls as utilization rises. For continuous RL training, where rollouts run at sustained pace, dedicated is typically the cheapest path at scale.

The calculator’s dedicated path uses saturated throughput estimates as defaults. A small, lightly-loaded test deployment will look more expensive per token than these numbers because the cluster is paid for whether it’s busy or idle. Tune the throughput inputs in the Advanced panel to match your actual rollout pace.

What’s covered#

The calculator currently includes the four models for which Tinker publishes per-token rates:

Model	Tinker prefill / sample (per 1M)
Kimi K2.6 (128K)	$5.15 / $12.81
Kimi K2.5 (128K)	$5.15 / $12.81
Qwen3.5-397B-A17B (256K)	$4.00 / $10.00
GPT-OSS-120B (128K)	$0.63 / $1.54

All Fireworks-side rates are taken from the public pages linked below and the constants live in snippets/multi-turn-cost-calculator.jsx — update there if either side’s pricing changes.

FAQ#

What is the fastest way to reduce wall-clock time?#

Increase replicas and overlap sampling/training where your workflow allows it. Those are usually the most direct levers for shortening end-to-end cycle time.

How should I compare costs between providers?#

Use matched assumptions for context length, truncation policy, and effective resource allocation. The calculator at the top of this page handles the math once you plug in your episode shape — be sure to also align truncation policy and effective context window between providers before drawing conclusions.

Sources#

Tinker pricing: thinkingmachines.ai/tinker
Fireworks GPU-hour pricing: fireworks.ai/pricing
Related: RFT Cost Estimator — same idea, but for the training-side bill (Fireworks GPU-hour, no comparison column).

This is an estimator, not a quote (updated). Real costs depend on your exact workload, cache hit rate, hardware utilization, and rate-card terms at run time.

Link last verified June 7, 2026. View original ↗

Source: Fireworks AI Docs

Link last verified: 2026-06-07