Skip to main content
Model distillation allows you to transfer knowledge from a larger, more powerful CLIP model (teacher) into a smaller, more efficient model (student). This enables you to create compact models that maintain much of the performance of larger models while being faster and requiring less memory.

Overview

Distillation in OpenCLIP uses the teacher model’s embeddings as soft targets to guide the training of the student model. The student learns to mimic both the teacher’s predictions and maintain contrastive alignment between images and text.

Basic Usage

To enable distillation, specify the teacher model using --distill-model and --distill-pretrained flags:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained openai \
    --train-data "/path/to/train_data.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --lr 5e-4 \
    --epochs 32

Distillation Parameters

Required Parameters

  • --distill-model: Architecture of the teacher model (e.g., ViT-L-14, ViT-H-14)
  • --distill-pretrained: Pre-trained weights for the teacher model (e.g., openai, laion2b_s32b_b82k)

How It Works

  1. Teacher Model: A large, pre-trained model is loaded and frozen
  2. Student Model: Your target model is trained normally
  3. Distillation Loss: The student learns to match the teacher’s embeddings in addition to the standard contrastive loss
  4. Combined Training: Both losses are used to guide the student’s learning

Example: Distilling from OpenAI ViT-L/14

One of the most common use cases is distilling from OpenAI’s large ViT-L/14 model:
python -m open_clip_train.main \
    --train-data "/data/laion400m/train-{0000..4000}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 8 \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained openai \
    --warmup 2000 \
    --lr 5e-4 \
    --wd 0.1 \
    --epochs 32 \
    --imagenet-val /data/imagenet/validation/ \
    --report-to wandb

Distillation from Different Teacher Models

From LAION Models

Distill from a larger LAION-trained model:
python -m open_clip_train.main \
    --model ViT-B-16 \
    --distill-model ViT-L-14 \
    --distill-pretrained laion2b_s32b_b82k \
    --train-data "/data/train.tar" \
    --batch-size 256 \
    --epochs 32

From DataComp Models

Distill from state-of-the-art DataComp models:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained datacomp_xl_s13b_b90k \
    --train-data "/data/datacomp.tar" \
    --batch-size 256 \
    --epochs 32

From SigLIP Models

Distill from efficient SigLIP models:
python -m open_clip_train.main \
    --model ViT-B-16 \
    --distill-model ViT-SO400M-14-SigLIP \
    --distill-pretrained webli \
    --siglip \
    --train-data "/data/train.tar" \
    --batch-size 256 \
    --epochs 32

Architecture Combinations

You can distill between different architecture families:

ViT to ViT (Different Sizes)

# Large to Base
--model ViT-B-32 --distill-model ViT-L-14 --distill-pretrained openai

# Huge to Large
--model ViT-L-14 --distill-model ViT-H-14 --distill-pretrained laion2b_s32b_b79k

# Base to Small
--model ViT-S-16 --distill-model ViT-B-16 --distill-pretrained openai

ConvNet to ViT

--model convnext_base --distill-model ViT-L-14 --distill-pretrained openai

ViT to ConvNet

--model ViT-B-32 --distill-model convnext_large_d_320 --distill-pretrained laion2b_s29b_b131k_ft

Distillation Loss

The distillation loss in OpenCLIP combines:
  1. Standard Contrastive Loss: Image-text alignment for the student model
  2. Distillation Loss: Student embeddings match teacher embeddings
The DistillClipLoss automatically balances these objectives:
from open_clip.loss import DistillClipLoss

loss = DistillClipLoss(
    local_loss=args.local_loss,
    gather_with_grad=args.gather_with_grad,
    cache_labels=True,
    rank=args.rank,
    world_size=args.world_size,
)

Important Constraints

Gradient Accumulation

Distillation currently requires --accum-freq 1 (no gradient accumulation):
# This is required for distillation
--accum-freq 1
If you need to simulate larger batches, increase --batch-size or use more GPUs instead.

Performance Considerations

Memory Usage

  • The teacher model is kept in memory (frozen) during training
  • Ensure you have enough GPU memory for both student and teacher models
  • The teacher does not require gradient storage, which saves memory
  • Use --precision amp or --precision fp16 to reduce memory usage

Training Speed

  • Distillation adds overhead from teacher forward passes
  • Expect ~1.5-2x slower training compared to non-distilled training
  • The teacher uses torch.no_grad() context to avoid gradient computation
  • Use mixed precision training to improve speed

Advanced Configuration

With Distributed Training

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained openai \
    --train-data "/data/train.tar" \
    --batch-size 256 \
    --precision amp \
    --workers 8 \
    --local-loss \
    --gather-with-grad

With Mixed Precision

python -m open_clip_train.main \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained openai \
    --precision amp \
    --train-data "/data/train.tar" \
    --batch-size 512

With Zero-Shot Evaluation

python -m open_clip_train.main \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained openai \
    --train-data "/data/train.tar" \
    --imagenet-val /data/imagenet/val/ \
    --zeroshot-frequency 1 \
    --batch-size 256

Monitoring Distillation

Key metrics to track during distillation:
  • Student Contrastive Loss: How well the student aligns images and text
  • Distillation Loss: How closely student embeddings match teacher embeddings
  • Zero-shot Accuracy: Performance on downstream tasks
  • Training Speed: Samples per second compared to baseline

Best Practices

  1. Teacher Selection: Use the best available teacher model for your domain
  2. Learning Rate: Start with lower learning rates (5e-5 to 5e-4) for distillation
  3. Batch Size: Use the largest batch size your hardware allows
  4. Data Quality: Higher quality data leads to better distillation results
  5. Training Duration: Distillation often benefits from longer training
  6. Architecture Gap: Smaller gaps between teacher and student typically work better
  7. Evaluation: Regularly evaluate zero-shot performance during training

Example Workflow

#!/bin/bash
# Distillation training script

python -m open_clip_train.main \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data "/data/cc12m/train-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --warmup 2000 \
    --batch-size 256 \
    --lr 5e-4 \
    --wd 0.1 \
    --epochs 32 \
    --workers 8 \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained openai \
    --precision amp \
    --imagenet-val /data/imagenet/validation/ \
    --name "distill-b32-from-l14" \
    --logs ./logs

Troubleshooting

Out of Memory

  • Reduce --batch-size
  • Enable --precision amp or --precision fp16
  • Use gradient checkpointing: --grad-checkpointing
  • Choose a smaller teacher model

Poor Performance

  • Increase training duration (--epochs)
  • Adjust learning rate (try 1e-4 to 5e-4)
  • Ensure teacher model is properly loaded
  • Check data quality and preprocessing
  • Verify batch size is sufficient for contrastive learning

Slow Training

  • Enable mixed precision: --precision amp
  • Use more workers: --workers 8
  • Enable distributed features: --local-loss --gather-with-grad
  • Consider using a smaller teacher model

Build docs developers (and LLMs) love