Generalized Knowledge Distillation Trainer

no

Original Documentation

model badge

Overview#

Generalized Knowledge Distillation (GKD) was proposed in On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem.

The abstract from the paper is the following:

Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher’s distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks, and task-agnostic distillation for instruction-tuning.

The key aspects of GKD are:

  1. It addresses the train-inference distribution mismatch in auto-regressive sequence models by training the student model on its self-generated output sequences.
  2. GKD allows flexibility in choosing different divergence measures between student and teacher models via the generalized Jensen-Shannon Divergence (JSD), which can be useful when the student lacks the capacity to fully mimic the teacher.

This post-training method was contributed by Kashif Rasul and Lewis Tunstall.

Usage tips#

The experimental.gkd.GKDTrainer is a wrapper around the SFTTrainer class that takes in a teacher model argument. It needs three parameters to be set via the experimental.gkd.GKDConfig namely:

  • lmbda: controls the student data fraction, i.e., the proportion of on-policy student-generated outputs. When lmbda=0.0, the loss reduces to supervised JSD where the student is trained with the token-level probabilities of the teacher. When lmbda=1.0, the loss reduces to on-policy JSD, where the student generates output sequences and token-specific feedback on these sequences from the teacher. For values in between [0, 1] it is random between the two based on the lmbda value for each batch.
  • seq_kd: controls whether to perform Sequence-Level KD (can be viewed as supervised FT on teacher-generated out). When seq_kd=True and lmbda=0.0, the loss reduces to supervised JSD, where the teacher generates output sequences and the student receives token-specific feedback on these sequences from the teacher.
  • beta: controls the interpolation in the generalized Jensen-Shannon Divergence. When beta=0.0 the loss approximates forward KL divergence, while for beta=1.0 the loss approximates reverse KL divergence. For values in between [0, 1] it interpolates between the two.

The authors find that on-policy data (high lmbda) performs better and the optimal beta varied depending on the task and evaluation method.

Make sure that attn_implementation="kernels-community/flash-attn2" when training Gemma models. Otherwise you will encounter NaNs in the logits due to the soft capping technique adopted by this architecture.

The basic API is as follows:

from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl.experimental.gkd import GKDConfig, GKDTrainer

NUM_DUMMY_SAMPLES = 100

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
# The model to optimise
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
# The teacher model to calculate the KL divergence against
teacher_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-1.5B-Instruct")

train_dataset = Dataset.from_dict(
    {
        "messages": [
            [
                {"role": "user", "content": "Hi, how are you?"},
                {"role": "assistant", "content": "I'm great thanks"},
            ]
        ]
        * NUM_DUMMY_SAMPLES
    }
)
eval_dataset = Dataset.from_dict(
    {
        "messages": [
            [
                {"role": "user", "content": "What colour is the sky?"},
                {"role": "assistant", "content": "The sky is blue"},
            ]
        ]
        * NUM_DUMMY_SAMPLES
    }
)

training_args = GKDConfig(output_dir="gkd-model", per_device_train_batch_size=1)
trainer = GKDTrainer(
    model=model,
    teacher_model=teacher_model,
    args=training_args,
    processing_class=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
trainer.train()

Expected dataset type#

The dataset should be formatted as a list of “messages” where each message is a list of dictionaries with the following keys:

  • role: either system, assistant or user
  • content: the message content

GKDTrainer[[trl.experimental.gkd.GKDTrainer]]#

trl.experimental.gkd.GKDTrainer[[trl.experimental.gkd.GKDTrainer]]#

Source

Trainer for Generalized Knowledge Distillation (GKD) of language models.

For details on GKD, see the paper: On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes.

traintrl.experimental.gkd.GKDTrainer.trainhttps://github.com/huggingface/trl/blob/v1.5.1/transformers/trainer.py#L1325[{“name”: “resume_from_checkpoint”, “val”: “: str | bool | None = None”}, {“name”: “trial”, “val”: “: optuna.Trial | dict[str, Any] | None = None”}, {“name”: “ignore_keys_for_eval”, “val”: “: list[str] | None = None”}]- resume_from_checkpoint (str or bool, optional) – If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.

  • trial (optuna.Trial or dict[str, Any], optional) – The trial run or the hyperparameter dictionary for hyperparameter search.
  • ignore_keys_for_eval (list[str], optional) – A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions for evaluation during the training.0~trainer_utils.TrainOutputObject containing the global step count, training loss, and metrics.

Main training entry point.

Parameters:

model (PreTrainedModel or torch.nn.Module or str, optional) : Model to be trained, or the string identifier of the model to be instantiated from a pretrained model.

teacher_model (PreTrainedModel or torch.nn.Module or str, optional) : Teacher model for knowledge distillation, or the string identifier of the model to be instantiated from a pretrained model.

args (experimental.gkd.GKDConfig, optional) : Training arguments.

data_collator (DataCollator, optional) : Data collator to batch samples from the dataset. It defaults to a experimental.utils.DataCollatorForChatML using the processing_class.

train_dataset (Dataset, optional) : Dataset for training.

eval_dataset (Dataset or dict of Dataset, optional) : Dataset for evaluation.

processing_class (PreTrainedTokenizerBase, BaseImageProcessor, FeatureExtractionMixin or ProcessorMixin, optional) : Class to process the data.

compute_metrics (Callable, optional) : Function to compute metrics at evaluation. Must take in an EvalPrediction and return a dictionary string to float.

callbacks (list of TrainerCallback, optional) : Callbacks to use during training.

optimizers (tuple of torch.optim.Optimizer and torch.optim.lr_scheduler.LambdaLR, optional, defaults to (None, None)) : Tuple containing the optimizer and the learning rate scheduler to use for training.

preprocess_logits_for_metrics (Callable, optional) : Function to preprocess the logits before computing the metrics. Must take in the logits and labels and return the logits to be used for metrics computation.

peft_config (PeftConfig, optional) : PEFT configuration to use PEFT for training. If None, PEFT is not used. If provided, the model will be wrapped with the specified PEFT adapter.

formatting_func (Callable, optional) : Function to format the dataset. Must take in an example and return an example.

Returns:

~trainer_utils.TrainOutput

Object containing the global step count, training loss, and metrics.

save_model[[trl.experimental.gkd.GKDTrainer.save_model]]#

Source

Will save the model, so you can reload it using from_pretrained().

Will only save from the main process.

push_to_hub[[trl.experimental.gkd.GKDTrainer.push_to_hub]]#

Source

Upload self.model and self.processing_class to the 🤗 model hub on the repo self.args.hub_model_id.

Parameters:

commit_message (str, optional, defaults to "End of training") : Message to commit while pushing.

blocking (bool, optional, defaults to True) : Whether the function should return only when the git push has finished.

token (str, optional, defaults to None) : Token with write permission to overwrite Trainer’s original args.

revision (str, optional) : The git revision to commit from. Defaults to the head of the “main” branch.

kwargs (dict[str, Any], optional) : Additional keyword arguments passed along to ~Trainer.create_model_card.

Returns:

The URL of the repository where the model was pushed if blocking=False, or a Future object tracking the progress of the commit if blocking=True.

GKDConfig[[trl.experimental.gkd.GKDConfig]]#

trl.experimental.gkd.GKDConfig[[trl.experimental.gkd.GKDConfig]]#

Source

Configuration class for experimental.gkd.GKDTrainer.

This class includes only the parameters that are specific to GKD training. For a full list of training arguments, please refer to the TrainingArguments and SFTConfig documentation.

Parameters:

temperature (float, optional, defaults to 0.9) : Temperature for sampling. The higher the temperature, the more random the completions.

lmbda (float, optional, defaults to 0.5) : Lambda parameter that controls the student data fraction (i.e., the proportion of on-policy student-generated outputs).

beta (float, optional, defaults to 0.5) : Interpolation coefficient between 0.0 and 1.0 of the Generalized Jensen-Shannon Divergence loss. When beta is 0.0, the loss is the KL divergence. When beta is 1.0, the loss is the Inverse KL Divergence.

max_new_tokens (int, optional, defaults to 128) : Maximum number of tokens to generate per completion.

teacher_model_name_or_path (str, optional) : Model name or path of the teacher model. If None, the teacher model will be the same as the model being trained.

teacher_model_init_kwargs (dict[str, Any], optional) : Keyword arguments to pass to AutoModelForCausalLM.from_pretrained when instantiating the teacher model from a string.

disable_dropout (bool, optional, defaults to True) : Whether to disable dropout in the model.

seq_kd (bool, optional, defaults to False) : Seq_kd parameter that controls whether to perform Sequence-Level KD (can be viewed as supervised FT on teacher-generated output).

Link last verified June 7, 2026. View original ↗
Source: TRL Docs
Link last verified: 2026-06-07