SDFT ↗

Original Documentation

Self-Distilled Fine-Tuning (SDFT) is described in Self-Training with On-Policy Self-Distillation for Language Model Alignment.

The TRL implementation adapts SDFT to the experimental trainer API while reusing the shared self-distillation infrastructure also used by SDPO.

In the current TRL implementation:

the teacher is the model itself (base weights with adapter disabled for PEFT, or the same model under no_grad for non-PEFT); use sync_ref_model=True for an EMA teacher
the dataset must provide both prompt and privileged_context
privileged_context contains only the extra teacher-only information; the trainer combines it with prompt to build the teacher prompt
teacher_prompt_template controls how prompt and privileged_context are combined into the teacher prompt
on-policy generation can use either the student prompt or the teacher-conditioned prompt via generate_from_teacher
num_loss_tokens_to_skip can exclude initial completion tokens from the distillation loss
SDFT currently supports text-only training and does not support use_vllm=True
the shared dataset contract is prompt plus privileged_context

Usage#

from datasets import Dataset

from trl.experimental.sdft import SDFTConfig, SDFTTrainer

dataset = Dataset.from_dict(
    {
        "prompt": [[{"role": "user", "content": "Solve 2+2."}]],
        "privileged_context": ["Example answer: 4."],
    }
)

training_args = SDFTConfig(
    output_dir="sdft-model",
    distillation_alpha=0.5,
    distillation_topk=5,
    max_completion_length=64,
)

trainer = SDFTTrainer(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

To generate from the teacher-conditioned prompt instead of the student prompt, set generate_from_teacher=True. To customize how the teacher prompt is built, set teacher_prompt_template on SDFTConfig.

Expected dataset columns#

Each example must provide:

prompt: the student-facing prompt
privileged_context: only the extra teacher-only information, such as a demonstration, hint, or privileged feedback

Both standard text prompts and conversational prompts are supported by the trainer prompt handling.

Callbacks#

The trainer emits a small set of callback hooks that are useful for debugging, observability, and tests. These hooks are intended as practical integration points for experimental self-distillation workflows.

Shared self-distillation hooks:

on_self_distillation_batch_prepared: fired when a self-distillation batch is ready. The payload includes prompt_ids, completion_ids, and old_per_token_logps when importance-sampling clipping inputs are available.
on_generation_batch_built: fired when a new buffered generation batch is created. The payload includes generate_every and steps_per_generation.

SDFT-specific hook:

on_generation_prompts_selected: fired when SDFT chooses the prompt source for on-policy generation. The payload includes the selected generation_prompts and the corresponding generation_prompt_text.

Example script#

Use trl/experimental/sdft/sdft.py to launch SDFT training from the command line. The script supports any causal LM from the Hub, custom local datasets via --dataset_path, and PEFT/LoRA via the standard ModelConfig flags.

python trl/experimental/sdft/sdft.py \
    --model_name_or_path Qwen/Qwen3.5-0.8B \
    --dataset_name your-org/your-dataset \
    --output_dir outputs/sdft-qwen3.5-0.8b \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --learning_rate 2e-5 \
    --max_prompt_length 1024 \
    --max_completion_length 512 \
    --generate_from_teacher \
    --sync_ref_model \
    --ref_model_sync_steps 1 \
    --ref_model_mixup_alpha 0.01 \
    --eval_strategy steps \
    --eval_steps 50 \
    --report_to wandb

SDFTConfig[[trl.experimental.sdft.SDFTConfig]]#

trl.experimental.sdft.SDFTConfig[[trl.experimental.sdft.SDFTConfig]]#

Source

Configuration class for SDFTTrainer.

This adapts the official SDFT implementation to the TRL trainer API while reusing the common self-distillation configuration shared with SDPO.

Parameters:

disable_dropout (bool, optional, defaults to True) : Whether to disable dropout in the student and teacher models.

generate_from_teacher (bool, optional, defaults to False) : Whether on-policy generation should use the teacher-conditioned prompt instead of the student prompt.

teacher_prompt_template (str, optional, defaults to "{prompt}\n\n{privileged_context}") : Template used to combine the student prompt and privileged context into the teacher prompt.

num_loss_tokens_to_skip (int, optional, defaults to 0) : Number of initial completion tokens to exclude from the distillation loss.

SDFTTrainer[[trl.experimental.sdft.SDFTTrainer]]#

trl.experimental.sdft.SDFTTrainer[[trl.experimental.sdft.SDFTTrainer]]#

Source

Trainer for SDFT-style on-policy self-distillation with explicit teacher prompts.

traintrl.experimental.sdft.SDFTTrainer.trainhttps://github.com/huggingface/trl/blob/v1.5.1/transformers/trainer.py#L1325[{“name”: “resume_from_checkpoint”, “val”: “: str | bool | None = None”}, {“name”: “trial”, “val”: “: optuna.Trial | dict[str, Any] | None = None”}, {“name”: “ignore_keys_for_eval”, “val”: “: list[str] | None = None”}]- resume_from_checkpoint (str or bool, optional) – If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.

trial (optuna.Trial or dict[str, Any], optional) – The trial run or the hyperparameter dictionary for hyperparameter search.
ignore_keys_for_eval (list[str], optional) – A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions for evaluation during the training.0~trainer_utils.TrainOutputObject containing the global step count, training loss, and metrics.

Main training entry point.

Parameters:

resume_from_checkpoint (str or bool, optional) : If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.

trial (optuna.Trial or dict[str, Any], optional) : The trial run or the hyperparameter dictionary for hyperparameter search.

ignore_keys_for_eval (list[str], optional) : A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions for evaluation during the training.

Returns:

~trainer_utils.TrainOutput

Object containing the global step count, training loss, and metrics.

save_model[[trl.experimental.sdft.SDFTTrainer.save_model]]#

Source

Will save the model, so you can reload it using from_pretrained().

Will only save from the main process.

push_to_hub[[trl.experimental.sdft.SDFTTrainer.push_to_hub]]#

Source

Upload self.model and self.processing_class to the 🤗 model hub on the repo self.args.hub_model_id.

Parameters:

commit_message (str, optional, defaults to "End of training") : Message to commit while pushing.

blocking (bool, optional, defaults to True) : Whether the function should return only when the git push has finished.

token (str, optional, defaults to None) : Token with write permission to overwrite Trainer’s original args.

revision (str, optional) : The git revision to commit from. Defaults to the head of the “main” branch.

kwargs (dict[str, Any], optional) : Additional keyword arguments passed along to ~Trainer.create_model_card.

Returns:

The URL of the repository where the model was pushed if blocking=False, or a Future object tracking the progress of the commit if blocking=True.

Link last verified June 7, 2026. View original ↗

Source: TRL Docs

Link last verified: 2026-06-07