Skip to main content

Overview

Single-node training is ideal for experiments and medium-scale training on one machine with multiple GPUs. OpenCLIP uses torchrun (PyTorch’s distributed launcher) for efficient multi-GPU training on a single node.

Prerequisites

  • Single machine with 1 or more GPUs
  • CUDA-capable GPUs (recommended: V100, A100, or newer)
  • OpenCLIP installed with training dependencies
  • Training data prepared in CSV or WebDataset format

Basic Single-Node Training

Single GPU Training

For single GPU training, you can use the training script directly without torchrun:
python -m open_clip_train.main \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data "/path/to/train_data.csv" \
    --val-data "/path/to/validation_data.csv" \
    --csv-img-key filepath \
    --csv-caption-key title \
    --imagenet-val /path/to/imagenet/root/val/ \
    --warmup 10000 \
    --batch-size 128 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 30 \
    --workers 8 \
    --model RN50
The --imagenet-val argument should point to the validation set of ImageNet for zero-shot evaluation, not the training set. The val folder should contain subfolders for each class.

Multi-GPU Training with torchrun

For training on multiple GPUs on a single node, use torchrun with the --nproc_per_node flag:
cd open_clip/src
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 4 \
    --imagenet-val /data/imagenet/validation/ \
    --warmup 10000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --model ViT-B-32 \
    --report-to tensorboard
Key Parameters:
  • --nproc_per_node 4: Number of GPUs to use (4 GPUs in this example)
  • --batch-size 320: Per-GPU batch size (total batch size = 320 × 4 = 1280)
  • --workers 4: Number of data loading workers per GPU
  • --precision amp: Automatic Mixed Precision for faster training and lower memory usage

WebDataset Training Example

WebDataset format is recommended for datasets larger than 10M samples:
torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data '/data/laion400m/{00000..41455}.tar' \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 256 \
    --precision amp \
    --workers 6 \
    --model ViT-B-32 \
    --warmup 2000 \
    --lr 5e-4 \
    --wd 0.2 \
    --epochs 32 \
    --save-frequency 1 \
    --report-to wandb
WebDataset-specific flags:
  • --dataset-type webdataset: Specify WebDataset format
  • --dataset-resampled: Enable sampling with replacement (recommended for large datasets)
  • --train-num-samples: Total number of samples in dataset

Batch Size and Worker Configuration

Calculating Effective Batch Size

The effective batch size is:
Effective Batch Size = batch_size × num_gpus × accum_freq
Example:
  • --batch-size 256 (per GPU)
  • --nproc_per_node 4 (4 GPUs)
  • --accum-freq 1 (no gradient accumulation)
  • Effective batch size: 256 × 4 × 1 = 1024

Optimizing Worker Count

The --workers parameter controls the number of data loading processes per GPU:
--workers 4  # 4 workers per GPU
Guidelines:
  • Start with 4-8 workers per GPU
  • Too few workers: GPU starvation (low utilization)
  • Too many workers: CPU/memory overhead
  • Monitor GPU utilization and adjust accordingly

Memory Optimization

If you run out of GPU memory, try these options in order:
  1. Enable Mixed Precision:
    --precision amp  # or amp_bf16 for bfloat16
    
  2. Reduce Batch Size:
    --batch-size 128  # Reduce from 256
    
  3. Enable Gradient Checkpointing:
    --grad-checkpointing
    
  4. Use Gradient Accumulation:
    --accum-freq 2  # Accumulate gradients over 2 steps
    

Monitoring Training

TensorBoard

Launch TensorBoard to monitor training progress:
# Start TensorBoard
tensorboard --logdir=logs/tensorboard/ --port=7777

# Then navigate to http://localhost:7777
Training command with TensorBoard:
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --report-to tensorboard \
    --model ViT-B-32 \
    # ... other arguments

Weights & Biases (wandb)

For cloud-based experiment tracking:
# Install wandb
pip install wandb

# Login (first time only)
wandb login

# Training command with wandb
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --report-to wandb \
    --wandb-project-name my-clip-project \
    --model ViT-B-32 \
    # ... other arguments

Both TensorBoard and wandb

You can log to both simultaneously:
--report-to tensorboard,wandb

Zero-Shot Evaluation During Training

Automatic zero-shot evaluation on ImageNet during training:
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --imagenet-val /data/imagenet/validation/ \
    --zeroshot-frequency 1 \
    # ... other arguments
Parameters:
  • --imagenet-val: Path to ImageNet validation set
  • --zeroshot-frequency 1: Run zero-shot eval every epoch
  • --zeroshot-frequency 2: Run zero-shot eval every 2 epochs

Complete Training Example

Here’s a complete example training ViT-B/32 on CC12M with 4 GPUs:
#!/bin/bash

# Change to source directory
cd open_clip/src

# Train with 4 GPUs
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard,wandb \
    --wandb-project-name "clip-cc12m" \
    --train-data "/data/cc12m/cc12m-train-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 320 \
    --precision amp \
    --workers 6 \
    --imagenet-val /data/imagenet/validation/ \
    --warmup 10000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --model ViT-B-32 \
    --name "vit-b32-cc12m" \
    --seed 42

Advanced Configuration

Custom Learning Rate Schedule

# Cosine schedule with warmup (default)
--lr-scheduler cosine \
--warmup 10000 \
--lr 1e-3

# Constant learning rate
--lr-scheduler const \
--warmup 10000 \
--lr 1e-3

# Constant with cooldown
--lr-scheduler const-cooldown \
--warmup 10000 \
--epochs-cooldown 5 \
--lr-cooldown-end 1e-6 \
--lr 1e-3

Patch Dropout for ViT Models

Speed up Vision Transformer training by 2-3x:
# Enable patch dropout during training
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --model ViT-B-32 \
    --force-patch-dropout 0.5 \
    # ... other arguments

# Disable patch dropout for final fine-tuning
--force-patch-dropout 0.0

Gradient Clipping

Prevent gradient explosion:
--grad-clip-norm 1.0

Model-Specific Optimizations

For Vision Transformers (ViT):
--model ViT-B-32 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.98 \
--eps 1e-6
For ResNet Models:
--model RN50 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.999 \
--eps 1e-8

Checkpointing and Resuming

Automatic Checkpointing

Checkpoints are saved automatically:
--save-frequency 1  # Save every epoch
--logs ./logs       # Checkpoint directory: ./logs/<experiment-name>/checkpoints/

Resume Training

Resume from a specific checkpoint:
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --resume /path/to/logs/experiment/checkpoints/epoch_10.pt \
    --train-data '/data/train.tar' \
    # ... other arguments (should match original training)
Resume from latest checkpoint:
--resume latest

Save Most Recent Checkpoint Only

To save disk space, keep only the latest checkpoint:
--save-most-recent \
--delete-previous-checkpoint

Performance Optimization

GPU Utilization

Monitor GPU usage:
# In another terminal
watch -n 1 nvidia-smi
Target: 90%+ GPU utilization If GPU utilization is low:
  • Increase --workers (data loading parallelism)
  • Use faster storage (NVMe SSD)
  • Increase --batch-size if memory allows
  • Ensure data is preprocessed and ready

Training Speed

Typical training speeds on A100 GPUs:
ModelBatch Size (per GPU)GPUsSamples/secTime per Epoch (CC12M)
RN505124~4000~45 min
ViT-B/323204~2500~1.2 hours
ViT-L/141288~1200~2.5 hours

Troubleshooting

Out of Memory Errors

RuntimeError: CUDA out of memory
Solutions:
  1. Reduce --batch-size
  2. Enable --precision amp
  3. Use --grad-checkpointing
  4. Increase --accum-freq and reduce --batch-size

Data Loading Bottleneck

GPU utilization < 50%
Solutions:
  1. Increase --workers
  2. Use faster storage (SSD vs HDD)
  3. Preprocess data to WebDataset format
  4. Check network speed if data is remote

Port Already in Use

Address already in use
Solution:
# Kill existing processes
pkill -f open_clip_train

# Or use a different master port
export MASTER_PORT=29500

ImageNet Validation Issues

If zero-shot evaluation fails, ensure:
  1. --imagenet-val points to validation set (not training set)
  2. Directory structure is correct:
    imagenet/val/
    ├── n01440764/
    ├── n01443537/
    └── ...
    
  3. Use the ImageNet validation prep script if needed

Example Training Scripts

Small-Scale Experiment (RN50 on CC3M)

#!/bin/bash
torchrun --nproc_per_node 2 -m open_clip_train.main \
    --train-data "/data/cc3m/train.csv" \
    --dataset-type csv \
    --csv-img-key filepath \
    --csv-caption-key title \
    --batch-size 256 \
    --precision amp \
    --workers 4 \
    --warmup 2000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 30 \
    --model RN50 \
    --report-to tensorboard

Medium-Scale (ViT-B/32 on CC12M)

#!/bin/bash
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data "/data/cc12m/cc12m-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 6 \
    --imagenet-val /data/imagenet/val/ \
    --warmup 10000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --model ViT-B-32 \
    --save-frequency 1 \
    --report-to wandb

Large Model (ViT-L/14)

#!/bin/bash
torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data "/data/laion400m/{00000..41455}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 128 \
    --precision amp \
    --workers 8 \
    --grad-checkpointing \
    --imagenet-val /data/imagenet/val/ \
    --warmup 10000 \
    --lr 5e-4 \
    --wd 0.2 \
    --epochs 32 \
    --model ViT-L-14 \
    --save-frequency 1 \
    --zeroshot-frequency 2 \
    --report-to wandb

Next Steps

Multi-Node Training

Scale to multiple machines with torchrun or SLURM

Configuration

Explore all available training parameters

Distributed Training

Advanced distributed training techniques

Data Preparation

Prepare datasets in CSV or WebDataset format

Build docs developers (and LLMs) love