Skip to main content

Overview

OpenCLIP provides extensive configuration options for training CLIP models. This page documents all important training flags and hyperparameters from params.py. To see all available options:
python -m open_clip_train.main --help

Data Configuration

Training Data

--train-data
string
Path to training data. For WebDataset, use glob patterns like /data/train-{0000..2175}.tar. Multiple sources can be combined with ::.
--train-data "/data/cc12m/train-{0000..2175}.tar"
--train-data "/data/cc12m/train.tar::/data/laion/train.tar"  # Multiple sources
--val-data
string
Path to validation data (same format as train-data).
--val-data "/data/val.csv"
--train-num-samples
integer
Total number of samples in training dataset. Required for WebDataset.
--train-num-samples 10968539  # CC12M
--val-num-samples
integer
Number of samples in validation dataset.
--dataset-type
string
default:"auto"
Dataset format: webdataset, csv, synthetic, or auto (auto-detect).
--dataset-type webdataset
--dataset-resampled
boolean
Enable sampling with replacement for webdataset. Recommended for large datasets and multiple data sources.
--dataset-resampled

CSV Data Parameters

--csv-separator
string
default:"\\t"
Column separator for CSV files (tab by default).
--csv-separator ","  # Use comma separator
--csv-img-key
string
default:"filepath"
Column name for image paths in CSV.
--csv-img-key filepath
--csv-caption-key
string
default:"title"
Column name for captions in CSV.
--csv-caption-key title

Data Upsampling

--train-data-upsampling-factors
string
Upsampling factors for multiple data sources, separated by ::. Controls relative sampling probability.
--train-data "/data/cc12m/train.tar::/data/cc3m/train.tar" \
--train-data-upsampling-factors "1::4"  # Sample CC3M 4x more frequently

Model Configuration

Model Selection

--model
string
default:"RN50"
Model architecture to train. See Model Architectures for all options.
--model ViT-B-32
--model ViT-L-14
--model RN50
--model coca_ViT-L-14  # CoCa model
--pretrained
string
Load pretrained weights. Can be a tag (e.g., laion2b_s34b_b79k) or a local path.
--pretrained laion2b_s34b_b79k
--pretrained /path/to/checkpoint.pt
--pretrained-image
boolean
Load ImageNet pretrained weights for the image encoder (if available).
--pretrained-image

Model Modifications

--force-image-size
integer
Override default image input size.
--force-image-size 224
--force-image-size 336 336  # Different height/width
--force-context-length
integer
Override default text context length.
--force-context-length 77
--force-patch-dropout
float
Override patch dropout probability for ViT models. Use 0.5-0.75 for 2-3x speedup.
--force-patch-dropout 0.5  # 50% patch dropout
--force-patch-dropout 0.0  # Disable patch dropout (fine-tuning)
--force-quick-gelu
boolean
Force QuickGELU activation (for compatibility with older checkpoints).
--force-custom-text
boolean
Force separate text tower (CustomTextCLIP architecture).

Training Hyperparameters

Batch Size and Epochs

--batch-size
integer
default:"64"
Batch size per GPU. Total batch size = batch_size × num_gpus × accum_freq.
--batch-size 256
--epochs
integer
default:"32"
Number of training epochs.
--epochs 32
--accum-freq
integer
default:"1"
Gradient accumulation frequency. Simulates larger batch sizes.
--accum-freq 4  # Effective batch = batch_size × 4

Learning Rate

--lr
float
Learning rate. Default depends on model:
  • ViT models: 5e-4
  • ResNet models: 5e-4
--lr 1e-3
--lr 5e-4
--warmup
integer
default:"10000"
Number of warmup steps (linear warmup from 0 to lr).
--warmup 10000
--lr-scheduler
string
default:"cosine"
Learning rate schedule: cosine, const, or const-cooldown.
--lr-scheduler cosine
--lr-scheduler const  # Constant LR after warmup
--lr-scheduler const-cooldown  # Constant with cooldown
--epochs-cooldown
integer
Number of cooldown epochs for const-cooldown scheduler.
--lr-scheduler const-cooldown \
--epochs-cooldown 5
--lr-cooldown-end
float
default:"0.0"
End learning rate for cooldown.
--lr-cooldown-end 1e-6
--lr-cooldown-power
float
default:"1.0"
Power for polynomial cooldown (1.0 = linear).
--lr-cooldown-power 1.0

Optimizer

--opt
string
default:"adamw"
Optimizer choice. Use adamw or timm/{optimizer} for timm optimizers.
--opt adamw
--opt timm/sgd
--beta1
float
Adam beta1 parameter. Default:
  • ViT: 0.9
  • ResNet: 0.9
--beta1 0.9
--beta2
float
Adam beta2 parameter. Default:
  • ViT: 0.98
  • ResNet: 0.999
--beta2 0.98
--eps
float
Adam epsilon parameter. Default:
  • ViT: 1e-6
  • ResNet: 1e-8
--eps 1e-6
--wd
float
default:"0.2"
Weight decay (L2 regularization).
--wd 0.2
--wd 0.1
--momentum
float
Momentum for timm optimizers (SGD, etc.).
--momentum 0.9

Gradient Clipping

--grad-clip-norm
float
Gradient clipping norm. Prevents gradient explosion.
--grad-clip-norm 1.0

Precision and Memory

Precision

--precision
string
default:"amp"
Training precision: amp, amp_bf16, bf16, fp16, fp32.
--precision amp        # Automatic Mixed Precision (FP16) - Recommended
--precision amp_bf16   # AMP with BFloat16 (A100/H100)
--precision fp32       # Full precision (slow, baseline)

Memory Optimization

--grad-checkpointing
boolean
Enable gradient checkpointing to reduce memory usage (slower training).
--grad-checkpointing
--local-loss
boolean
Calculate loss with local features @ global (reduces memory from O(n²) to O(n)).
--local-loss
--gather-with-grad
boolean
Enable gradient flow through feature gathering (use with —local-loss).
--gather-with-grad
Always use --local-loss and --gather-with-grad together for multi-GPU training (8+ GPUs). See Distributed Training.

Data Loading

--workers
integer
default:"4"
Number of data loading workers per GPU.
--workers 8  # 8 workers per GPU
Recommended: 4-8 workers per GPU for optimal performance.

Image Preprocessing

--image-mean
float[]
Override image normalization mean (RGB).
--image-mean 0.485 0.456 0.406  # ImageNet statistics
--image-std
float[]
Override image normalization std (RGB).
--image-std 0.229 0.224 0.225  # ImageNet statistics
--image-interpolation
string
Image resize interpolation: bicubic, bilinear, or random.
--image-interpolation bicubic
--image-resize-mode
string
Image resize mode: shortest, longest, or squash (inference only).
--image-resize-mode shortest
--aug-cfg
key=value
Data augmentation configuration (key-value pairs).
--aug-cfg scale_range=0.08::1.0 ratio_range=0.75::1.33

Model Locking (Transfer Learning)

Image Tower

--lock-image
boolean
Lock (freeze) entire image encoder.
--lock-image
--lock-image-unlocked-groups
integer
default:"0"
Leave last N image tower layer groups unlocked.
--lock-image --lock-image-unlocked-groups 2  # Freeze all but last 2 groups
--lock-image-freeze-bn-stats
boolean
Freeze BatchNorm running statistics in locked layers.
--lock-image-freeze-bn-stats

Text Tower

--lock-text
boolean
Lock (freeze) entire text encoder.
--lock-text
--lock-text-unlocked-layers
integer
default:"0"
Leave last N text tower layers unlocked.
--lock-text --lock-text-unlocked-layers 10  # Train last 10 layers
--lock-text-freeze-layer-norm
boolean
Freeze LayerNorm in locked text layers.
--lock-text-freeze-layer-norm

Checkpointing and Logging

Checkpoints

--save-frequency
integer
default:"1"
Save checkpoint every N epochs.
--save-frequency 1  # Save every epoch
--save-frequency 5  # Save every 5 epochs
--save-most-recent
boolean
Save most recent checkpoint as epoch_latest.pt.
--save-most-recent
--delete-previous-checkpoint
boolean
Delete previous checkpoint after saving new one (saves disk space).
--delete-previous-checkpoint
--resume
string
Resume training from checkpoint path or “latest”.
--resume /path/to/checkpoint.pt
--resume latest  # Resume from latest checkpoint

Logging

--logs
string
default:"./logs/"
Directory for logs and checkpoints.
--logs ./logs/
--name
string
Experiment name (defaults to auto-generated based on timestamp and config).
--name "vit-b32-cc12m-experiment"
--report-to
string
Logging backends: tensorboard, wandb, or tensorboard,wandb.
--report-to tensorboard
--report-to wandb
--report-to tensorboard,wandb  # Both
--log-every-n-steps
integer
default:"100"
Log training metrics every N steps.
--log-every-n-steps 100

Weights & Biases

--wandb-project-name
string
default:"open-clip"
W&B project name.
--wandb-project-name "my-clip-experiments"
--wandb-notes
string
Notes for W&B run.
--wandb-notes "Testing new learning rate schedule"

Evaluation

--imagenet-val
string
Path to ImageNet validation set for zero-shot evaluation during training.
--imagenet-val /data/imagenet/validation/
--imagenet-v2
string
Path to ImageNet-v2 for additional zero-shot evaluation.
--imagenet-v2 /data/imagenet-v2/
--zeroshot-frequency
integer
default:"2"
Run zero-shot evaluation every N epochs.
--zeroshot-frequency 1  # Every epoch
--val-frequency
integer
default:"1"
Run validation every N epochs.
--val-frequency 1

CoCa-Specific Parameters

--coca-contrastive-loss-weight
float
default:"1.0"
Weight for CoCa contrastive loss.
--coca-contrastive-loss-weight 1.0
--coca-caption-loss-weight
float
default:"2.0"
Weight for CoCa caption generation loss.
--coca-caption-loss-weight 2.0
For CoCa fine-tuning on captioning only:
--coca-contrastive-loss-weight 0 \
--coca-caption-loss-weight 1

Distributed Training

--dist-url
string
URL for distributed training initialization.
--dist-url tcp://localhost:12345
--dist-backend
string
Distributed backend: nccl (NVIDIA GPU), hccl (Ascend NPU), or gloo (CPU).
--dist-backend nccl  # Default for GPU
--horovod
boolean
Use Horovod for distributed training.
--horovod
--ddp-static-graph
boolean
Enable static graph optimization for DDP (PyTorch >= 1.11).
--ddp-static-graph
--use-bn-sync
boolean
Use synchronized batch normalization across GPUs.
--use-bn-sync

Advanced Options

Compilation

--torchcompile
boolean
Compile model with torch.compile() (PyTorch >= 2.0).
--torchcompile
--torchscript
boolean
TorchScript the model.
--torchscript
--trace
boolean
Trace model with torch.jit.trace (inference only).
--trace

Model Distillation

--distill-model
string
Teacher model architecture for distillation.
--distill-model ViT-L-14
--distill-pretrained
string
Teacher model pretrained weights.
--distill-pretrained openai

Loss Configuration

--siglip
boolean
Use SigLip (sigmoid) loss instead of standard CLIP loss.
--siglip
--loss-dist-impl
string
Distributed loss implementation override.
--loss-dist-impl custom

Remote Syncing

--remote-sync
string
Remote path to sync checkpoints (S3 bucket or filesystem).
--remote-sync s3://my-bucket/checkpoints
--remote-sync-frequency
integer
default:"300"
Sync to remote every N seconds.
--remote-sync-frequency 600  # Sync every 10 minutes
--remote-sync-protocol
string
default:"s3"
Protocol for remote sync: s3 or fsspec.
--remote-sync-protocol s3

Experimental

--use-bnb-linear
string
Use bitsandbytes linear layers for int8 training (experimental).
--use-bnb-linear SwitchBackLinearGlobal

Other

--seed
integer
default:"0"
Random seed for reproducibility.
--seed 42
--device
string
default:"cuda"
Device for training: cuda or cpu.
--device cuda
--cache-dir
string
Override default cache directory for model/tokenizer downloads.
--cache-dir /path/to/cache
--debug
boolean
Enable debug logging.
--debug
--log-local
boolean
Log on local master (each node) instead of global master only.
--log-local
--copy-codebase
boolean
Copy entire codebase to log directory.
--copy-codebase

Example Configurations

Small-Scale Training (RN50 on CC3M)

python -m open_clip_train.main \
    --train-data "/data/cc3m/train.csv" \
    --dataset-type csv \
    --csv-img-key filepath \
    --csv-caption-key title \
    --batch-size 256 \
    --precision amp \
    --workers 4 \
    --warmup 2000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 30 \
    --model RN50 \
    --save-frequency 5 \
    --report-to tensorboard

Medium-Scale Training (ViT-B/32 on CC12M)

torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data "/data/cc12m/cc12m-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 320 \
    --precision amp \
    --workers 6 \
    --imagenet-val /data/imagenet/validation/ \
    --warmup 10000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --model ViT-B-32 \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --local-loss \
    --gather-with-grad \
    --report-to wandb

Large-Scale Training (ViT-L/14 on LAION-400M)

srun python -u src/open_clip_train/main.py \
    --train-data "/data/laion400m/{00000..41455}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 128 \
    --precision amp \
    --grad-checkpointing \
    --workers 8 \
    --warmup 10000 \
    --lr 5e-4 \
    --wd 0.2 \
    --epochs 32 \
    --model ViT-L-14 \
    --save-frequency 1 \
    --zeroshot-frequency 2 \
    --local-loss \
    --gather-with-grad \
    --force-patch-dropout 0.5 \
    --report-to wandb \
    --remote-sync s3://bucket/checkpoints \
    --delete-previous-checkpoint

ViT-B/32

--model ViT-B-32 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.98 \
--eps 1e-6 \
--batch-size 256-512 \
--precision amp

ViT-L/14

--model ViT-L-14 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.98 \
--eps 1e-6 \
--batch-size 128-256 \
--precision amp \
--grad-checkpointing \
--force-patch-dropout 0.5

RN50

--model RN50 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.999 \
--eps 1e-8 \
--batch-size 256-512 \
--precision amp

Next Steps

Single-Node Training

Apply these configurations to single-node training

Distributed Training

Configure distributed training optimizations

Data Preparation

Configure data loading and preprocessing

Fine-tuning

Configure fine-tuning from pretrained models

Build docs developers (and LLMs) love