Overview
Single-node training is ideal for experiments and medium-scale training on one machine with multiple GPUs. OpenCLIP uses torchrun (PyTorch’s distributed launcher) for efficient multi-GPU training on a single node.
Prerequisites
Single machine with 1 or more GPUs
CUDA-capable GPUs (recommended: V100, A100, or newer)
OpenCLIP installed with training dependencies
Training data prepared in CSV or WebDataset format
Basic Single-Node Training
Single GPU Training
For single GPU training, you can use the training script directly without torchrun:
python -m open_clip_train.main \
--save-frequency 1 \
--zeroshot-frequency 1 \
--report-to tensorboard \
--train-data "/path/to/train_data.csv" \
--val-data "/path/to/validation_data.csv" \
--csv-img-key filepath \
--csv-caption-key title \
--imagenet-val /path/to/imagenet/root/val/ \
--warmup 10000 \
--batch-size 128 \
--lr 1e-3 \
--wd 0.1 \
--epochs 30 \
--workers 8 \
--model RN50
The --imagenet-val argument should point to the validation set of ImageNet for zero-shot evaluation, not the training set. The val folder should contain subfolders for each class.
Multi-GPU Training with torchrun
For training on multiple GPUs on a single node, use torchrun with the --nproc_per_node flag:
cd open_clip/src
torchrun --nproc_per_node 4 -m open_clip_train.main \
--train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \
--train-num-samples 10968539 \
--dataset-type webdataset \
--batch-size 320 \
--precision amp \
--workers 4 \
--imagenet-val /data/imagenet/validation/ \
--warmup 10000 \
--lr 1e-3 \
--wd 0.1 \
--epochs 32 \
--model ViT-B-32 \
--report-to tensorboard
Key Parameters:
--nproc_per_node 4: Number of GPUs to use (4 GPUs in this example)
--batch-size 320: Per-GPU batch size (total batch size = 320 × 4 = 1280)
--workers 4: Number of data loading workers per GPU
--precision amp: Automatic Mixed Precision for faster training and lower memory usage
WebDataset Training Example
WebDataset format is recommended for datasets larger than 10M samples:
torchrun --nproc_per_node 8 -m open_clip_train.main \
--train-data '/data/laion400m/{00000..41455}.tar' \
--train-num-samples 400000000 \
--dataset-type webdataset \
--dataset-resampled \
--batch-size 256 \
--precision amp \
--workers 6 \
--model ViT-B-32 \
--warmup 2000 \
--lr 5e-4 \
--wd 0.2 \
--epochs 32 \
--save-frequency 1 \
--report-to wandb
WebDataset-specific flags:
--dataset-type webdataset: Specify WebDataset format
--dataset-resampled: Enable sampling with replacement (recommended for large datasets)
--train-num-samples: Total number of samples in dataset
Batch Size and Worker Configuration
Calculating Effective Batch Size
The effective batch size is:
Effective Batch Size = batch_size × num_gpus × accum_freq
Example:
--batch-size 256 (per GPU)
--nproc_per_node 4 (4 GPUs)
--accum-freq 1 (no gradient accumulation)
Effective batch size: 256 × 4 × 1 = 1024
Optimizing Worker Count
The --workers parameter controls the number of data loading processes per GPU:
--workers 4 # 4 workers per GPU
Guidelines:
Start with 4-8 workers per GPU
Too few workers: GPU starvation (low utilization)
Too many workers: CPU/memory overhead
Monitor GPU utilization and adjust accordingly
Memory Optimization
If you run out of GPU memory, try these options in order:
Enable Mixed Precision:
--precision amp # or amp_bf16 for bfloat16
Reduce Batch Size:
--batch-size 128 # Reduce from 256
Enable Gradient Checkpointing:
Use Gradient Accumulation:
--accum-freq 2 # Accumulate gradients over 2 steps
Monitoring Training
TensorBoard
Launch TensorBoard to monitor training progress:
# Start TensorBoard
tensorboard --logdir=logs/tensorboard/ --port=7777
# Then navigate to http://localhost:7777
Training command with TensorBoard:
torchrun --nproc_per_node 4 -m open_clip_train.main \
--train-data '/data/train.tar' \
--report-to tensorboard \
--model ViT-B-32 \
# ... other arguments
Weights & Biases (wandb)
For cloud-based experiment tracking:
# Install wandb
pip install wandb
# Login (first time only)
wandb login
# Training command with wandb
torchrun --nproc_per_node 4 -m open_clip_train.main \
--train-data '/data/train.tar' \
--report-to wandb \
--wandb-project-name my-clip-project \
--model ViT-B-32 \
# ... other arguments
Both TensorBoard and wandb
You can log to both simultaneously:
--report-to tensorboard,wandb
Zero-Shot Evaluation During Training
Automatic zero-shot evaluation on ImageNet during training:
torchrun --nproc_per_node 4 -m open_clip_train.main \
--train-data '/data/train.tar' \
--imagenet-val /data/imagenet/validation/ \
--zeroshot-frequency 1 \
# ... other arguments
Parameters:
--imagenet-val: Path to ImageNet validation set
--zeroshot-frequency 1: Run zero-shot eval every epoch
--zeroshot-frequency 2: Run zero-shot eval every 2 epochs
Complete Training Example
Here’s a complete example training ViT-B/32 on CC12M with 4 GPUs:
#!/bin/bash
# Change to source directory
cd open_clip/src
# Train with 4 GPUs
torchrun --nproc_per_node 4 -m open_clip_train.main \
--save-frequency 1 \
--zeroshot-frequency 1 \
--report-to tensorboard,wandb \
--wandb-project-name "clip-cc12m" \
--train-data "/data/cc12m/cc12m-train-{0000..2175}.tar" \
--train-num-samples 10968539 \
--dataset-type webdataset \
--dataset-resampled \
--batch-size 320 \
--precision amp \
--workers 6 \
--imagenet-val /data/imagenet/validation/ \
--warmup 10000 \
--lr 1e-3 \
--wd 0.1 \
--epochs 32 \
--model ViT-B-32 \
--name "vit-b32-cc12m" \
--seed 42
Advanced Configuration
Custom Learning Rate Schedule
# Cosine schedule with warmup (default)
--lr-scheduler cosine \
--warmup 10000 \
--lr 1e-3
# Constant learning rate
--lr-scheduler const \
--warmup 10000 \
--lr 1e-3
# Constant with cooldown
--lr-scheduler const-cooldown \
--warmup 10000 \
--epochs-cooldown 5 \
--lr-cooldown-end 1e-6 \
--lr 1e-3
Patch Dropout for ViT Models
Speed up Vision Transformer training by 2-3x:
# Enable patch dropout during training
torchrun --nproc_per_node 4 -m open_clip_train.main \
--model ViT-B-32 \
--force-patch-dropout 0.5 \
# ... other arguments
# Disable patch dropout for final fine-tuning
--force-patch-dropout 0.0
Gradient Clipping
Prevent gradient explosion:
Model-Specific Optimizations
For Vision Transformers (ViT):
--model ViT-B-32 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.98 \
--eps 1e-6
For ResNet Models:
--model RN50 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.999 \
--eps 1e-8
Checkpointing and Resuming
Automatic Checkpointing
Checkpoints are saved automatically:
--save-frequency 1 # Save every epoch
--logs ./logs # Checkpoint directory: ./logs/<experiment-name>/checkpoints/
Resume Training
Resume from a specific checkpoint:
torchrun --nproc_per_node 4 -m open_clip_train.main \
--resume /path/to/logs/experiment/checkpoints/epoch_10.pt \
--train-data '/data/train.tar' \
# ... other arguments (should match original training)
Resume from latest checkpoint:
Save Most Recent Checkpoint Only
To save disk space, keep only the latest checkpoint:
--save-most-recent \
--delete-previous-checkpoint
GPU Utilization
Monitor GPU usage:
# In another terminal
watch -n 1 nvidia-smi
Target: 90%+ GPU utilization
If GPU utilization is low:
Increase --workers (data loading parallelism)
Use faster storage (NVMe SSD)
Increase --batch-size if memory allows
Ensure data is preprocessed and ready
Training Speed
Typical training speeds on A100 GPUs:
Model Batch Size (per GPU) GPUs Samples/sec Time per Epoch (CC12M) RN50 512 4 ~4000 ~45 min ViT-B/32 320 4 ~2500 ~1.2 hours ViT-L/14 128 8 ~1200 ~2.5 hours
Troubleshooting
Out of Memory Errors
RuntimeError: CUDA out of memory
Solutions:
Reduce --batch-size
Enable --precision amp
Use --grad-checkpointing
Increase --accum-freq and reduce --batch-size
Data Loading Bottleneck
Solutions:
Increase --workers
Use faster storage (SSD vs HDD)
Preprocess data to WebDataset format
Check network speed if data is remote
Port Already in Use
Solution:
# Kill existing processes
pkill -f open_clip_train
# Or use a different master port
export MASTER_PORT = 29500
ImageNet Validation Issues
If zero-shot evaluation fails, ensure:
--imagenet-val points to validation set (not training set)
Directory structure is correct:
imagenet/val/
├── n01440764/
├── n01443537/
└── ...
Use the ImageNet validation prep script if needed
Example Training Scripts
Small-Scale Experiment (RN50 on CC3M)
#!/bin/bash
torchrun --nproc_per_node 2 -m open_clip_train.main \
--train-data "/data/cc3m/train.csv" \
--dataset-type csv \
--csv-img-key filepath \
--csv-caption-key title \
--batch-size 256 \
--precision amp \
--workers 4 \
--warmup 2000 \
--lr 1e-3 \
--wd 0.1 \
--epochs 30 \
--model RN50 \
--report-to tensorboard
Medium-Scale (ViT-B/32 on CC12M)
#!/bin/bash
torchrun --nproc_per_node 4 -m open_clip_train.main \
--train-data "/data/cc12m/cc12m-{0000..2175}.tar" \
--train-num-samples 10968539 \
--dataset-type webdataset \
--batch-size 320 \
--precision amp \
--workers 6 \
--imagenet-val /data/imagenet/val/ \
--warmup 10000 \
--lr 1e-3 \
--wd 0.1 \
--epochs 32 \
--model ViT-B-32 \
--save-frequency 1 \
--report-to wandb
Large Model (ViT-L/14)
#!/bin/bash
torchrun --nproc_per_node 8 -m open_clip_train.main \
--train-data "/data/laion400m/{00000..41455}.tar" \
--train-num-samples 400000000 \
--dataset-type webdataset \
--dataset-resampled \
--batch-size 128 \
--precision amp \
--workers 8 \
--grad-checkpointing \
--imagenet-val /data/imagenet/val/ \
--warmup 10000 \
--lr 5e-4 \
--wd 0.2 \
--epochs 32 \
--model ViT-L-14 \
--save-frequency 1 \
--zeroshot-frequency 2 \
--report-to wandb
Next Steps
Multi-Node Training Scale to multiple machines with torchrun or SLURM
Configuration Explore all available training parameters
Distributed Training Advanced distributed training techniques
Data Preparation Prepare datasets in CSV or WebDataset format