Overview
Fine-tuning allows you to adapt pretrained CLIP models to specific domains or datasets by continuing training from a pretrained checkpoint. This is often more efficient than training from scratch and can achieve better performance with less data.When to Fine-tune
✅ Fine-tune when:
- You have a pretrained model that’s close to your target domain
- You have limited training data (1M-100M samples)
- You want to adapt to a specific domain (medical images, satellite imagery, etc.)
- You need faster convergence than training from scratch
- You want to improve zero-shot performance on specific tasks
❌ Train from scratch when:
- Your domain is very different from the pretrained model’s training data
- You have massive amounts of training data (>1B samples)
- You need a completely custom architecture
- You want to experiment with new training objectives
Loading Pretrained Weights
From OpenCLIP Pretrained Models
Use the--pretrained flag with a model tag:
laion2b_s34b_b79k: ViT-B/32 on LAION-2Blaion2b_s32b_b82k: ViT-L/14 on LAION-2Bopenai: Original OpenAI CLIP weightsdatacomp_xl_s13b_b90k: DataComp-1B models
From Local Checkpoint
Use a local checkpoint file:From Hugging Face Hub
Download from Hugging Face and use local path:Resuming Training from Checkpoint
The--resume flag continues training from a checkpoint, including optimizer state:
| Flag | Use Case | Loads Optimizer | Loads Epoch | Learning Rate |
|---|---|---|---|---|
--resume | Continue interrupted training | ✅ Yes | ✅ Yes | Original schedule continues |
--pretrained | Fine-tune from pretrained | ❌ No | ❌ No | New schedule from epoch 0 |
Resume from Latest Checkpoint
Fine-tuning Strategies
1. Full Model Fine-tuning
Fine-tune all parameters with a lower learning rate:- ⬇️ Lower learning rate:
1e-5vs1e-3for pretraining - ⏱️ Fewer epochs:
10vs32for pretraining - 🔥 Shorter warmup:
1000vs10000steps
2. Frozen Image Encoder (Text-Only Fine-tuning)
Freeze the image encoder and only fine-tune the text encoder:- 💾 Lower memory usage
- ⚡ Faster training
- 🎯 Useful when adapting to new vocabulary/concepts
3. Frozen Text Encoder (Image-Only Fine-tuning)
Freeze the text encoder and only fine-tune the image encoder:- Adapting to new image domains (medical, satellite, etc.)
- Maintaining text understanding while improving visual features
4. Partial Fine-tuning
Freeze early layers and fine-tune later layers:- ⚖️ Balance between adaptation and preservation
- 💾 Lower memory and compute requirements
- 🛡️ Less prone to overfitting on small datasets
5. LiT (Locked Image Tuning)
Lock image encoder with ImageNet pretrained weights, train text encoder from scratch:Learning Rate Adjustment
Fine-tuning requires careful learning rate selection:Recommended Learning Rates
| Strategy | Learning Rate | Warmup Steps | Epochs |
|---|---|---|---|
| Full fine-tuning | 1e-5 to 1e-4 | 1000-5000 | 5-10 |
| Partial fine-tuning | 5e-5 to 5e-4 | 1000-3000 | 5-10 |
| Frozen encoder | 1e-4 to 1e-3 | 1000-5000 | 10-20 |
| From scratch (reference) | 1e-3 to 5e-3 | 10000 | 32+ |
Learning Rate Schedules
Cosine with warmup (recommended):Fine-tuning Examples
Domain Adaptation: Medical Images
Small Dataset Fine-tuning
Multilingual Fine-tuning
High-Resolution Fine-tuning
WiSE-FT: Robust Fine-tuning
For robust fine-tuning that maintains performance under distribution shift, use the WiSE-FT repository. WiSE-FT (Weight-Space Ensembling for Fine-Tuning) averages the weights of:- Zero-shot pretrained model
- Fine-tuned model
WiSE-FT Workflow
Monitoring Fine-tuning
Zero-shot Evaluation
Track zero-shot performance during fine-tuning:- Fine-tuning dataset performance (improves)
- Zero-shot ImageNet accuracy (may degrade if overfitting)
Validation Loss
Weights & Biases Logging
Common Fine-tuning Issues
Overfitting
Symptoms:- Training loss decreases, validation loss increases
- Zero-shot performance degrades significantly
- Reduce learning rate
- Use fewer epochs
- Freeze more layers
- Add regularization (increase
--wd) - Use more data augmentation
Underfitting
Symptoms:- Both training and validation loss remain high
- No improvement over pretrained model
- Increase learning rate
- Train for more epochs
- Unfreeze more layers
- Reduce regularization
Catastrophic Forgetting
Symptoms:- Good performance on fine-tuning dataset
- Poor zero-shot performance on general tasks
- Use lower learning rate
- Freeze early layers
- Use WiSE-FT weight ensembling
- Mix fine-tuning data with general data
Best Practices
Fine-tuning checklist:
- ✅ Start with a pretrained model close to your domain
- ✅ Use 10-100× lower learning rate than pretraining
- ✅ Fine-tune for 5-20 epochs (much less than pretraining)
- ✅ Monitor both task performance and zero-shot performance
- ✅ Try partial fine-tuning before full fine-tuning
- ✅ Use validation set to prevent overfitting
- ✅ Consider WiSE-FT for robust fine-tuning
- ✅ Save checkpoints frequently for comparison
Fine-tuning Templates
Quick Fine-tuning (Small Dataset)
Production Fine-tuning (Large Dataset)
Conservative Fine-tuning (Preserve Generalization)
Next Steps
Training Overview
Learn about training CLIP models from scratch
Configuration
Explore all fine-tuning parameters
Pretrained Models
Browse available pretrained models
WiSE-FT
Learn about robust fine-tuning with weight ensembling
