SDFT ↗
noOriginal Documentation
Self-Distilled Fine-Tuning (SDFT) is described in Self-Training with On-Policy Self-Distillation for Language Model Alignment.
The TRL implementation adapts SDFT to the experimental trainer API while reusing the shared self-distillation infrastructure also used by SDPO.
In the current TRL implementation:
- the teacher is the model itself (base weights with adapter disabled for PEFT, or the same model under
no_gradfor non-PEFT); usesync_ref_model=Truefor an EMA teacher - the dataset must provide both
promptandprivileged_context privileged_contextcontains only the extra teacher-only information; the trainer combines it withpromptto build the teacher promptteacher_prompt_templatecontrols howpromptandprivileged_contextare combined into the teacher prompt- on-policy generation can use either the student prompt or the teacher-conditioned prompt via
generate_from_teacher num_loss_tokens_to_skipcan exclude initial completion tokens from the distillation loss- SDFT currently supports text-only training and does not support
use_vllm=True - the shared dataset contract is
promptplusprivileged_context
Usage#
from datasets import Dataset
from trl.experimental.sdft import SDFTConfig, SDFTTrainer
dataset = Dataset.from_dict(
{
"prompt": [[{"role": "user", "content": "Solve 2+2."}]],
"privileged_context": ["Example answer: 4."],
}
)
training_args = SDFTConfig(
output_dir="sdft-model",
distillation_alpha=0.5,
distillation_topk=5,
max_completion_length=64,
)
trainer = SDFTTrainer(
model="Qwen/Qwen2.5-1.5B-Instruct",
args=training_args,
train_dataset=dataset,
)
trainer.train()To generate from the teacher-conditioned prompt instead of the student prompt, set generate_from_teacher=True.
To customize how the teacher prompt is built, set teacher_prompt_template on SDFTConfig.
Expected dataset columns#
Each example must provide:
prompt: the student-facing promptprivileged_context: only the extra teacher-only information, such as a demonstration, hint, or privileged feedback
Both standard text prompts and conversational prompts are supported by the trainer prompt handling.
Callbacks#
The trainer emits a small set of callback hooks that are useful for debugging, observability, and tests. These hooks are intended as practical integration points for experimental self-distillation workflows.
Shared self-distillation hooks:
on_self_distillation_batch_prepared: fired when a self-distillation batch is ready. The payload includesprompt_ids,completion_ids, andold_per_token_logpswhen importance-sampling clipping inputs are available.on_generation_batch_built: fired when a new buffered generation batch is created. The payload includesgenerate_everyandsteps_per_generation.
SDFT-specific hook:
on_generation_prompts_selected: fired when SDFT chooses the prompt source for on-policy generation. The payload includes the selectedgeneration_promptsand the correspondinggeneration_prompt_text.
Example script#
Use trl/experimental/sdft/sdft.py to launch SDFT training from the command line. The script supports any causal LM from the Hub, custom local datasets via --dataset_path, and PEFT/LoRA via the standard ModelConfig flags.
python trl/experimental/sdft/sdft.py \
--model_name_or_path Qwen/Qwen3.5-0.8B \
--dataset_name your-org/your-dataset \
--output_dir outputs/sdft-qwen3.5-0.8b \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--learning_rate 2e-5 \
--max_prompt_length 1024 \
--max_completion_length 512 \
--generate_from_teacher \
--sync_ref_model \
--ref_model_sync_steps 1 \
--ref_model_mixup_alpha 0.01 \
--eval_strategy steps \
--eval_steps 50 \
--report_to wandbSDFTConfig[[trl.experimental.sdft.SDFTConfig]]#
trl.experimental.sdft.SDFTConfig[[trl.experimental.sdft.SDFTConfig]]#
Configuration class for SDFTTrainer.
This adapts the official SDFT implementation to the TRL trainer API while reusing the common self-distillation configuration shared with SDPO.
Parameters:
disable_dropout (bool, optional, defaults to True) : Whether to disable dropout in the student and teacher models.
generate_from_teacher (bool, optional, defaults to False) : Whether on-policy generation should use the teacher-conditioned prompt instead of the student prompt.
teacher_prompt_template (str, optional, defaults to "{prompt}\n\n{privileged_context}") : Template used to combine the student prompt and privileged context into the teacher prompt.
num_loss_tokens_to_skip (int, optional, defaults to 0) : Number of initial completion tokens to exclude from the distillation loss.
SDFTTrainer[[trl.experimental.sdft.SDFTTrainer]]#
trl.experimental.sdft.SDFTTrainer[[trl.experimental.sdft.SDFTTrainer]]#
Trainer for SDFT-style on-policy self-distillation with explicit teacher prompts.
traintrl.experimental.sdft.SDFTTrainer.trainhttps://github.com/huggingface/trl/blob/v1.5.1/transformers/trainer.py#L1325[{“name”: “resume_from_checkpoint”, “val”: “: str | bool | None = None”}, {“name”: “trial”, “val”: “: optuna.Trial | dict[str, Any] | None = None”}, {“name”: “ignore_keys_for_eval”, “val”: “: list[str] | None = None”}]- resume_from_checkpoint (str or bool, optional) –
If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a
bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance
of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.
- trial (
optuna.Trialordict[str, Any], optional) – The trial run or the hyperparameter dictionary for hyperparameter search. - ignore_keys_for_eval (
list[str], optional) – A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions for evaluation during the training.0~trainer_utils.TrainOutputObject containing the global step count, training loss, and metrics.
Main training entry point.
Parameters:
resume_from_checkpoint (str or bool, optional) : If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.
trial (optuna.Trial or dict[str, Any], optional) : The trial run or the hyperparameter dictionary for hyperparameter search.
ignore_keys_for_eval (list[str], optional) : A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions for evaluation during the training.
Returns:
~trainer_utils.TrainOutput
Object containing the global step count, training loss, and metrics.
save_model[[trl.experimental.sdft.SDFTTrainer.save_model]]#
Will save the model, so you can reload it using from_pretrained().
Will only save from the main process.
push_to_hub[[trl.experimental.sdft.SDFTTrainer.push_to_hub]]#
Upload self.model and self.processing_class to the 🤗 model hub on the repo self.args.hub_model_id.
Parameters:
commit_message (str, optional, defaults to "End of training") : Message to commit while pushing.
blocking (bool, optional, defaults to True) : Whether the function should return only when the git push has finished.
token (str, optional, defaults to None) : Token with write permission to overwrite Trainer’s original args.
revision (str, optional) : The git revision to commit from. Defaults to the head of the “main” branch.
kwargs (dict[str, Any], optional) : Additional keyword arguments passed along to ~Trainer.create_model_card.
Returns:
The URL of the repository where the model was pushed if blocking=False, or a Future object tracking the
progress of the commit if blocking=True.