Overview
OpenCLIP provides extensive configuration options for training CLIP models. This page documents all important training flags and hyperparameters fromparams.py.
To see all available options:
Data Configuration
Training Data
Path to training data. For WebDataset, use glob patterns like
/data/train-{0000..2175}.tar. Multiple sources can be combined with ::.Path to validation data (same format as train-data).
Total number of samples in training dataset. Required for WebDataset.
Number of samples in validation dataset.
Dataset format:
webdataset, csv, synthetic, or auto (auto-detect).Enable sampling with replacement for webdataset. Recommended for large datasets and multiple data sources.
CSV Data Parameters
Column separator for CSV files (tab by default).
Column name for image paths in CSV.
Column name for captions in CSV.
Data Upsampling
Upsampling factors for multiple data sources, separated by
::. Controls relative sampling probability.Model Configuration
Model Selection
Model architecture to train. See Model Architectures for all options.
Load pretrained weights. Can be a tag (e.g.,
laion2b_s34b_b79k) or a local path.Load ImageNet pretrained weights for the image encoder (if available).
Model Modifications
Override default image input size.
Override default text context length.
Override patch dropout probability for ViT models. Use 0.5-0.75 for 2-3x speedup.
Force QuickGELU activation (for compatibility with older checkpoints).
Force separate text tower (CustomTextCLIP architecture).
Training Hyperparameters
Batch Size and Epochs
Batch size per GPU. Total batch size = batch_size × num_gpus × accum_freq.
Number of training epochs.
Gradient accumulation frequency. Simulates larger batch sizes.
Learning Rate
Learning rate. Default depends on model:
- ViT models: 5e-4
- ResNet models: 5e-4
Number of warmup steps (linear warmup from 0 to lr).
Learning rate schedule:
cosine, const, or const-cooldown.Number of cooldown epochs for
const-cooldown scheduler.End learning rate for cooldown.
Power for polynomial cooldown (1.0 = linear).
Optimizer
Optimizer choice. Use
adamw or timm/{optimizer} for timm optimizers.Adam beta1 parameter. Default:
- ViT: 0.9
- ResNet: 0.9
Adam beta2 parameter. Default:
- ViT: 0.98
- ResNet: 0.999
Adam epsilon parameter. Default:
- ViT: 1e-6
- ResNet: 1e-8
Weight decay (L2 regularization).
Momentum for timm optimizers (SGD, etc.).
Gradient Clipping
Gradient clipping norm. Prevents gradient explosion.
Precision and Memory
Precision
Training precision:
amp, amp_bf16, bf16, fp16, fp32.Memory Optimization
Enable gradient checkpointing to reduce memory usage (slower training).
Calculate loss with local features @ global (reduces memory from O(n²) to O(n)).
Enable gradient flow through feature gathering (use with —local-loss).
Data Loading
Number of data loading workers per GPU.
Image Preprocessing
Override image normalization mean (RGB).
Override image normalization std (RGB).
Image resize interpolation:
bicubic, bilinear, or random.Image resize mode:
shortest, longest, or squash (inference only).Data augmentation configuration (key-value pairs).
Model Locking (Transfer Learning)
Image Tower
Lock (freeze) entire image encoder.
Leave last N image tower layer groups unlocked.
Freeze BatchNorm running statistics in locked layers.
Text Tower
Lock (freeze) entire text encoder.
Leave last N text tower layers unlocked.
Freeze LayerNorm in locked text layers.
Checkpointing and Logging
Checkpoints
Save checkpoint every N epochs.
Save most recent checkpoint as
epoch_latest.pt.Delete previous checkpoint after saving new one (saves disk space).
Resume training from checkpoint path or “latest”.
Logging
Directory for logs and checkpoints.
Experiment name (defaults to auto-generated based on timestamp and config).
Logging backends:
tensorboard, wandb, or tensorboard,wandb.Log training metrics every N steps.
Weights & Biases
W&B project name.
Notes for W&B run.
Evaluation
Path to ImageNet validation set for zero-shot evaluation during training.
Path to ImageNet-v2 for additional zero-shot evaluation.
Run zero-shot evaluation every N epochs.
Run validation every N epochs.
CoCa-Specific Parameters
Weight for CoCa contrastive loss.
Weight for CoCa caption generation loss.
Distributed Training
URL for distributed training initialization.
Distributed backend:
nccl (NVIDIA GPU), hccl (Ascend NPU), or gloo (CPU).Use Horovod for distributed training.
Enable static graph optimization for DDP (PyTorch >= 1.11).
Use synchronized batch normalization across GPUs.
Advanced Options
Compilation
Compile model with torch.compile() (PyTorch >= 2.0).
TorchScript the model.
Trace model with torch.jit.trace (inference only).
Model Distillation
Teacher model architecture for distillation.
Teacher model pretrained weights.
Loss Configuration
Use SigLip (sigmoid) loss instead of standard CLIP loss.
Distributed loss implementation override.
Remote Syncing
Remote path to sync checkpoints (S3 bucket or filesystem).
Sync to remote every N seconds.
Protocol for remote sync:
s3 or fsspec.Experimental
Use bitsandbytes linear layers for int8 training (experimental).
Other
Random seed for reproducibility.
Device for training:
cuda or cpu.Override default cache directory for model/tokenizer downloads.
Enable debug logging.
Log on local master (each node) instead of global master only.
Copy entire codebase to log directory.
Example Configurations
Small-Scale Training (RN50 on CC3M)
Medium-Scale Training (ViT-B/32 on CC12M)
Large-Scale Training (ViT-L/14 on LAION-400M)
Recommended Settings by Model
ViT-B/32
ViT-L/14
RN50
Next Steps
Single-Node Training
Apply these configurations to single-node training
Distributed Training
Configure distributed training optimizations
Data Preparation
Configure data loading and preprocessing
Fine-tuning
Configure fine-tuning from pretrained models
