SSD ↗

Original Documentation

Simple Self-Distillation (SSD) is described in Embarrassingly Simple Self-Distillation Improves Code Generation.

SSD samples completions from the model at a training-time temperature and truncation configuration, then fine-tunes on those raw, unverified samples with standard cross-entropy loss. It requires no reward model, verifier, teacher model, or reinforcement learning — only a set of problem prompts and the model itself.

In the current TRL implementation:

the model generates completions at a specified training-time temperature (temperature) and truncation (top_k, top_p)
the dataset only requires a prompt column
training uses standard cross-entropy loss on the generated completions
empty or single-line stub completions are filtered by default (filter_empty=True)
the evaluation-time temperature and truncation are set independently at inference time
vLLM can be used for faster generation via use_vllm=True (see vLLM integration)

Usage#

from datasets import Dataset

from trl.experimental.ssd import SSDConfig, SSDTrainer

dataset = Dataset.from_dict(
    {
        "prompt": [
            [{"role": "user", "content": "Write a function to add two numbers."}],
            [{"role": "user", "content": "Write a function to check if a number is prime."}],
        ],
    }
)

training_args = SSDConfig(
    output_dir="ssd-model",
    temperature=0.6,           # T_train from the paper
    top_k=20,                  # training-time top-k truncation
    top_p=0.95,                # training-time top-p truncation
    max_completion_length=65536,
    learning_rate=5e-6,
)

trainer = SSDTrainer(
    model="Qwen/Qwen3-4B-Instruct",
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

Expected dataset columns#

Each example must provide:

prompt: the problem prompt (string or conversational format)

No privileged_context, reward functions, or teacher model are needed.

Key hyperparameters#

The paper identifies the following key hyperparameters:

temperature: training-time sampling temperature (T_train). Higher values create more diverse samples but may include more noise. The paper uses T_train=0.6 with truncation.
top_k and top_p: training-time truncation parameters (rho_train). These suppress low-probability distractor tails during data synthesis.
T_eval: the evaluation-time decoding temperature is set independently at inference time. The paper shows that T_train and T_eval compose through an effective temperature T_eff = T_train * T_eval, with a broad optimal band.

Example script#

Use trl/experimental/ssd/ssd.py to launch SSD training from the command line. The script supports any causal LM from the Hub, custom local datasets via --dataset_path, and PEFT/LoRA via the standard ModelConfig flags.

python trl/experimental/ssd/ssd.py \
    --model_name_or_path Qwen/Qwen3-4B-Instruct-2507 \
    --dataset_name microsoft/rStar-Coder \
    --dataset_config seed_sft \
    --prompt_column question \
    --output_dir outputs/ssd-qwen3-4b \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 32 \
    --learning_rate 5e-6 \
    --lr_scheduler_type cosine \
    --max_prompt_length 1024 \
    --max_completion_length 65536 \
    --temperature 1.6 \
    --top_k 20 \
    --top_p 0.8 \
    --num_train_epochs 1 \
    --bf16 \
    --report_to trackio

Evaluation on LiveCodeBench#

Use trl/experimental/ssd/ssd_eval.py to evaluate a base model or an SSD-trained checkpoint on LiveCodeBench v6. The script uses vLLM for generation and LiveCodeBench’s official codegen_metrics for sandboxed pass@k scoring; default decoding parameters match Table 3 of the paper.

python trl/experimental/ssd/ssd_eval.py \
    --model_name_or_path <path-or-repo> \
    --temperature 1.1 --top_k 20 --top_p 0.8 \
    --n 1 \
    --output_file outputs/lcb_v6.json

SSDConfig[[trl.experimental.ssd.SSDConfig]]#

trl.experimental.ssd.SSDConfig[[trl.experimental.ssd.SSDConfig]]#

Source

Configuration class for SSDTrainer.

Implements Simple Self-Distillation (SSD) from Embarrassingly Simple Self-Distillation Improves Code Generation. SSD samples completions from the model at a training-time temperature and truncation configuration, then fine-tunes on those raw, unverified samples with standard cross-entropy loss.

The temperature, top_k, and top_p parameters control the training-time sampling configuration (T_train, rho_train in the paper). The evaluation-time configuration (T_eval, rho_eval) is set independently at inference time.

SSDTrainer[[trl.experimental.ssd.SSDTrainer]]#

trl.experimental.ssd.SSDTrainer[[trl.experimental.ssd.SSDTrainer]]#

Source

Trainer for SSD-style on-policy self-distillation with cross-entropy loss.

SSD generates completions from the model at a specified training-time temperature and truncation configuration, then fine-tunes on those raw, unverified samples using standard cross-entropy loss. The dataset only requires a prompt column.

traintrl.experimental.ssd.SSDTrainer.trainhttps://github.com/huggingface/trl/blob/v1.5.1/transformers/trainer.py#L1325[{“name”: “resume_from_checkpoint”, “val”: “: str | bool | None = None”}, {“name”: “trial”, “val”: “: optuna.Trial | dict[str, Any] | None = None”}, {“name”: “ignore_keys_for_eval”, “val”: “: list[str] | None = None”}]- resume_from_checkpoint (str or bool, optional) – If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.

trial (optuna.Trial or dict[str, Any], optional) – The trial run or the hyperparameter dictionary for hyperparameter search.
ignore_keys_for_eval (list[str], optional) – A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions for evaluation during the training.0~trainer_utils.TrainOutputObject containing the global step count, training loss, and metrics.

Main training entry point.

Parameters:

resume_from_checkpoint (str or bool, optional) : If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.

trial (optuna.Trial or dict[str, Any], optional) : The trial run or the hyperparameter dictionary for hyperparameter search.

ignore_keys_for_eval (list[str], optional) : A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions for evaluation during the training.

Returns:

~trainer_utils.TrainOutput

Object containing the global step count, training loss, and metrics.

save_model[[trl.experimental.ssd.SSDTrainer.save_model]]#

Source

Will save the model, so you can reload it using from_pretrained().

Will only save from the main process.

push_to_hub[[trl.experimental.ssd.SSDTrainer.push_to_hub]]#

Source

Upload self.model and self.processing_class to the 🤗 model hub on the repo self.args.hub_model_id.

Parameters:

commit_message (str, optional, defaults to "End of training") : Message to commit while pushing.

blocking (bool, optional, defaults to True) : Whether the function should return only when the git push has finished.

token (str, optional, defaults to None) : Token with write permission to overwrite Trainer’s original args.

revision (str, optional) : The git revision to commit from. Defaults to the head of the “main” branch.

kwargs (dict[str, Any], optional) : Additional keyword arguments passed along to ~Trainer.create_model_card.

Returns:

The URL of the repository where the model was pushed if blocking=False, or a Future object tracking the progress of the commit if blocking=True.

Link last verified June 7, 2026. View original ↗

Source: TRL Docs

Link last verified: 2026-06-07