Cleanup and Teardown ↗
noOriginal Documentation
Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.
Delete trainer jobs and deployments after experiments to avoid leaked resources.
What this is#
RLOR trainer jobs and weight-sync-enabled deployments hold GPU resources. Always clean up after experiments — especially if jobs terminate unexpectedly. In new SDK and cookbook code, cleanup is owned by the SDK-managed service client.
Automatic cleanup via the SDK-managed service#
Create the service with cleanup options, then close it in finally:
from fireworks.training.sdk import FiretitanServiceClient
service = FiretitanServiceClient.from_firetitan_config(
api_key=api_key,
base_url=base_url,
base_model="accounts/fireworks/models/qwen3-8b",
tokenizer_model="Qwen/Qwen3-8B",
lora_rank=0,
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
deployment_id="research-serving",
cleanup_trainer_on_close=True,
cleanup_deployment_on_close="scale_to_zero",
)
try:
run_training_loop()
finally:
service.close()cleanup_trainer_on_close=True deletes SDK-managed trainers. Separate reference trainers are governed by cleanup_reference_trainer_on_close (default True). cleanup_deployment_on_close="scale_to_zero" releases deployment GPUs while keeping the deployment resource around; use "delete" only when you want to remove the deployment entirely.
Cookbook recipes use the same service-client lifecycle internally and close the service through an ExitStack.
The standalone ResourceCleanup context manager and setup_infra helper have been removed from the cookbook. Provisioning and teardown now live behind the SDK-managed service client. See Migrating from the deprecated managed infra.
Trainer inactivity cleanup#
Long-running RLOR trainer jobs are automatically stopped after 60 minutes with no tracked activity. The trainer reports this activity to the control plane, and tracked activity includes trainer API operations and active-session heartbeats.
When creating a trainer through the REST API (POST /v1/accounts/{account_id}/rlorTrainerJobs), set inactivityTimeout to a positive protobuf JSON duration to choose a different timeout:
{
"inactivityTimeout": "1800s"
}When creating a trainer through the legacy manager API, set TrainerJobConfig.inactivity_timeout and pass the config to TrainerJobManager.create(...) or TrainerJobManager.create_and_wait(...):
from datetime import timedelta
from fireworks.training.sdk import TrainerJobConfig
config = TrainerJobConfig(
base_model="accounts/fireworks/models/qwen3-8b",
training_shape_ref="accounts/fireworks/trainingShapes/<shape>/versions/<version>",
inactivity_timeout=timedelta(minutes=30),
)With firectl, use --inactivity-timeout 30m or --inactivity-timeout 2h. When the value is omitted or set to 0, Fireworks uses the 60-minute default.
To disable automatic inactivity cleanup, set disableInactivityCleanup in the REST API, set TrainerJobConfig.disable_inactivity_cleanup=True in the Training SDK, or pass --disable-inactivity-cleanup in firectl. The trainer will not be stopped due to inactivity, and GPU usage continues to accrue while the trainer is running, so delete the trainer when you no longer need it.
Manual compatibility cleanup#
If you provisioned resources yourself with TrainerJobManager / DeploymentManager instead of the managed service, delete them directly.
Cleaning up RLOR trainer jobs#
import os
from fireworks.training.sdk import TrainerJobManager, DeploymentManager
api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")
rlor_mgr = TrainerJobManager(api_key=api_key, base_url=base_url)
deploy_mgr = DeploymentManager(api_key=api_key, base_url=base_url)
# Delete known trainer jobs from this run
for job_id in ["<policy-job-id>", "<reference-job-id>"]:
rlor_mgr.delete(job_id=job_id)Cleaning up deployments#
deploy_mgr.delete(deployment_id="<deployment-id>")If you want to keep the deployment resource but release GPUs (lighter alternative to delete):
deploy_mgr.scale_to_zero(deployment_id="<deployment-id>")This sets both minReplicaCount and maxReplicaCount to 0, releasing all accelerators while keeping the deployment available for future scale-up.
Manual cleanup with try/finally#
policy_job_id = "<policy-job-id>"
reference_job_id = "<reference-job-id>"
deployment_id = "research-loop-serving"
try:
run_training_loop()
finally:
rlor_mgr.delete(policy_job_id)
rlor_mgr.delete(reference_job_id)
deploy_mgr.delete(deployment_id)Checking for leaked resources#
Track the IDs you create (trainer job IDs + deployment ID) and clean those explicitly. For broad account-wide discovery, use the Fireworks console or the managed fw.*.list() APIs.
Operational guidance#
- Delete both policy and reference trainers when running GRPO (which uses 2 RLOR jobs).
- Close the managed service in
finallyso trainer/reference/deployment cleanup runs on Ctrl+C or exceptions. - Don’t delete a trainer while a
save_weights_for_sampleroperation is in progress — wait for it to complete first.