Single-Node Training

Overview

Single-node training is ideal for experiments and medium-scale training on one machine with multiple GPUs. OpenCLIP uses torchrun (PyTorch’s distributed launcher) for efficient multi-GPU training on a single node.

Prerequisites

Single machine with 1 or more GPUs
CUDA-capable GPUs (recommended: V100, A100, or newer)
OpenCLIP installed with training dependencies
Training data prepared in CSV or WebDataset format

Basic Single-Node Training

Single GPU Training

For single GPU training, you can use the training script directly without torchrun:

python -m open_clip_train.main \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data "/path/to/train_data.csv" \
    --val-data "/path/to/validation_data.csv" \
    --csv-img-key filepath \
    --csv-caption-key title \
    --imagenet-val /path/to/imagenet/root/val/ \
    --warmup 10000 \
    --batch-size 128 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 30 \
    --workers 8 \
    --model RN50

The --imagenet-val argument should point to the validation set of ImageNet for zero-shot evaluation, not the training set. The val folder should contain subfolders for each class.

Multi-GPU Training with torchrun

For training on multiple GPUs on a single node, use torchrun with the --nproc_per_node flag:

cd open_clip/src
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 4 \
    --imagenet-val /data/imagenet/validation/ \
    --warmup 10000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --model ViT-B-32 \
    --report-to tensorboard

Key Parameters:

--nproc_per_node 4: Number of GPUs to use (4 GPUs in this example)
--batch-size 320: Per-GPU batch size (total batch size = 320 × 4 = 1280)
--workers 4: Number of data loading workers per GPU
--precision amp: Automatic Mixed Precision for faster training and lower memory usage

WebDataset Training Example

WebDataset format is recommended for datasets larger than 10M samples:

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data '/data/laion400m/{00000..41455}.tar' \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 256 \
    --precision amp \
    --workers 6 \
    --model ViT-B-32 \
    --warmup 2000 \
    --lr 5e-4 \
    --wd 0.2 \
    --epochs 32 \
    --save-frequency 1 \
    --report-to wandb

WebDataset-specific flags:

--dataset-type webdataset: Specify WebDataset format
--dataset-resampled: Enable sampling with replacement (recommended for large datasets)
--train-num-samples: Total number of samples in dataset

Batch Size and Worker Configuration

Calculating Effective Batch Size

The effective batch size is:

Effective Batch Size = batch_size × num_gpus × accum_freq

Example:

--batch-size 256 (per GPU)
--nproc_per_node 4 (4 GPUs)
--accum-freq 1 (no gradient accumulation)
Effective batch size: 256 × 4 × 1 = 1024

Optimizing Worker Count

The --workers parameter controls the number of data loading processes per GPU:

--workers 4  # 4 workers per GPU

Guidelines:

Start with 4-8 workers per GPU
Too few workers: GPU starvation (low utilization)
Too many workers: CPU/memory overhead
Monitor GPU utilization and adjust accordingly

Memory Optimization

If you run out of GPU memory, try these options in order:

Enable Mixed Precision:

--precision amp  # or amp_bf16 for bfloat16

Reduce Batch Size:
```
--batch-size 128  # Reduce from 256
```
Enable Gradient Checkpointing:
```
--grad-checkpointing
```

Use Gradient Accumulation:

--accum-freq 2  # Accumulate gradients over 2 steps

Monitoring Training

TensorBoard

Launch TensorBoard to monitor training progress:

# Start TensorBoard
tensorboard --logdir=logs/tensorboard/ --port=7777

# Then navigate to http://localhost:7777

Training command with TensorBoard:

torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --report-to tensorboard \
    --model ViT-B-32 \
    # ... other arguments

Weights & Biases (wandb)

For cloud-based experiment tracking:

# Install wandb
pip install wandb

# Login (first time only)
wandb login

# Training command with wandb
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --report-to wandb \
    --wandb-project-name my-clip-project \
    --model ViT-B-32 \
    # ... other arguments

Both TensorBoard and wandb

You can log to both simultaneously:

--report-to tensorboard,wandb

Zero-Shot Evaluation During Training

Automatic zero-shot evaluation on ImageNet during training:

torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --imagenet-val /data/imagenet/validation/ \
    --zeroshot-frequency 1 \
    # ... other arguments

Parameters:

--imagenet-val: Path to ImageNet validation set
--zeroshot-frequency 1: Run zero-shot eval every epoch
--zeroshot-frequency 2: Run zero-shot eval every 2 epochs

Complete Training Example

Here’s a complete example training ViT-B/32 on CC12M with 4 GPUs:

#!/bin/bash

# Change to source directory
cd open_clip/src

# Train with 4 GPUs
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard,wandb \
    --wandb-project-name "clip-cc12m" \
    --train-data "/data/cc12m/cc12m-train-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 320 \
    --precision amp \
    --workers 6 \
    --imagenet-val /data/imagenet/validation/ \
    --warmup 10000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --model ViT-B-32 \
    --name "vit-b32-cc12m" \
    --seed 42

Advanced Configuration

Custom Learning Rate Schedule

# Cosine schedule with warmup (default)
--lr-scheduler cosine \
--warmup 10000 \
--lr 1e-3

# Constant learning rate
--lr-scheduler const \
--warmup 10000 \
--lr 1e-3

# Constant with cooldown
--lr-scheduler const-cooldown \
--warmup 10000 \
--epochs-cooldown 5 \
--lr-cooldown-end 1e-6 \
--lr 1e-3

Patch Dropout for ViT Models

Speed up Vision Transformer training by 2-3x:

# Enable patch dropout during training
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --model ViT-B-32 \
    --force-patch-dropout 0.5 \
    # ... other arguments

# Disable patch dropout for final fine-tuning
--force-patch-dropout 0.0

Gradient Clipping

Prevent gradient explosion:

--grad-clip-norm 1.0

Model-Specific Optimizations

For Vision Transformers (ViT):

--model ViT-B-32 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.98 \
--eps 1e-6

For ResNet Models:

--model RN50 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.999 \
--eps 1e-8

Checkpointing and Resuming

Automatic Checkpointing

Checkpoints are saved automatically:

--save-frequency 1  # Save every epoch
--logs ./logs       # Checkpoint directory: ./logs/<experiment-name>/checkpoints/

Resume Training

Resume from a specific checkpoint:

torchrun --nproc_per_node 4 -m open_clip_train.main \
    --resume /path/to/logs/experiment/checkpoints/epoch_10.pt \
    --train-data '/data/train.tar' \
    # ... other arguments (should match original training)

Resume from latest checkpoint:

--resume latest

Save Most Recent Checkpoint Only

To save disk space, keep only the latest checkpoint:

--save-most-recent \
--delete-previous-checkpoint

Performance Optimization

GPU Utilization

Monitor GPU usage:

# In another terminal
watch -n 1 nvidia-smi

Target: 90%+ GPU utilization If GPU utilization is low:

Increase --workers (data loading parallelism)
Use faster storage (NVMe SSD)
Increase --batch-size if memory allows
Ensure data is preprocessed and ready

Training Speed

Typical training speeds on A100 GPUs:

Model	Batch Size (per GPU)	GPUs	Samples/sec	Time per Epoch (CC12M)
RN50	512	4	~4000	~45 min
ViT-B/32	320	4	~2500	~1.2 hours
ViT-L/14	128	8	~1200	~2.5 hours

Troubleshooting

Out of Memory Errors

RuntimeError: CUDA out of memory

Solutions:

Reduce --batch-size
Enable --precision amp
Use --grad-checkpointing
Increase --accum-freq and reduce --batch-size

Data Loading Bottleneck

GPU utilization < 50%

Solutions:

Increase --workers
Use faster storage (SSD vs HDD)
Preprocess data to WebDataset format
Check network speed if data is remote

Port Already in Use

Address already in use

Solution:

# Kill existing processes
pkill -f open_clip_train

# Or use a different master port
export MASTER_PORT=29500

ImageNet Validation Issues

If zero-shot evaluation fails, ensure:

--imagenet-val points to validation set (not training set)

Directory structure is correct:

imagenet/val/
├── n01440764/
├── n01443537/
└── ...

Use the ImageNet validation prep script if needed

Example Training Scripts

Small-Scale Experiment (RN50 on CC3M)

#!/bin/bash
torchrun --nproc_per_node 2 -m open_clip_train.main \
    --train-data "/data/cc3m/train.csv" \
    --dataset-type csv \
    --csv-img-key filepath \
    --csv-caption-key title \
    --batch-size 256 \
    --precision amp \
    --workers 4 \
    --warmup 2000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 30 \
    --model RN50 \
    --report-to tensorboard

Medium-Scale (ViT-B/32 on CC12M)

#!/bin/bash
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data "/data/cc12m/cc12m-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 6 \
    --imagenet-val /data/imagenet/val/ \
    --warmup 10000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --model ViT-B-32 \
    --save-frequency 1 \
    --report-to wandb

Large Model (ViT-L/14)

#!/bin/bash
torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data "/data/laion400m/{00000..41455}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 128 \
    --precision amp \
    --workers 8 \
    --grad-checkpointing \
    --imagenet-val /data/imagenet/val/ \
    --warmup 10000 \
    --lr 5e-4 \
    --wd 0.2 \
    --epochs 32 \
    --model ViT-L-14 \
    --save-frequency 1 \
    --zeroshot-frequency 2 \
    --report-to wandb

Next Steps

Multi-Node Training

Scale to multiple machines with torchrun or SLURM

Configuration

Explore all available training parameters

Distributed Training

Advanced distributed training techniques

Data Preparation

Prepare datasets in CSV or WebDataset format

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

​Overview

​Prerequisites

​Basic Single-Node Training

​Single GPU Training

​Multi-GPU Training with torchrun

​WebDataset Training Example

​Batch Size and Worker Configuration

​Calculating Effective Batch Size

​Optimizing Worker Count

​Memory Optimization

​Monitoring Training

​TensorBoard

​Weights & Biases (wandb)

​Both TensorBoard and wandb

​Zero-Shot Evaluation During Training

​Complete Training Example

​Advanced Configuration

​Custom Learning Rate Schedule

​Patch Dropout for ViT Models

​Gradient Clipping

​Model-Specific Optimizations

​Checkpointing and Resuming

​Automatic Checkpointing

​Resume Training

​Save Most Recent Checkpoint Only

​Performance Optimization

​GPU Utilization

​Training Speed

​Troubleshooting

​Out of Memory Errors

​Data Loading Bottleneck

​Port Already in Use

​ImageNet Validation Issues

​Example Training Scripts

​Small-Scale Experiment (RN50 on CC3M)

​Medium-Scale (ViT-B/32 on CC12M)

​Large Model (ViT-L/14)

​Next Steps

Multi-Node Training

Configuration

Distributed Training

Data Preparation

Build docs developers (and LLMs) love

Overview

Prerequisites

Basic Single-Node Training

Single GPU Training

Multi-GPU Training with torchrun

WebDataset Training Example

Batch Size and Worker Configuration

Calculating Effective Batch Size

Optimizing Worker Count

Memory Optimization

Monitoring Training

TensorBoard

Weights & Biases (wandb)

Both TensorBoard and wandb

Zero-Shot Evaluation During Training

Complete Training Example

Advanced Configuration

Custom Learning Rate Schedule

Patch Dropout for ViT Models

Gradient Clipping

Model-Specific Optimizations

Checkpointing and Resuming

Automatic Checkpointing

Resume Training

Save Most Recent Checkpoint Only

Performance Optimization

GPU Utilization

Training Speed

Troubleshooting

Out of Memory Errors

Data Loading Bottleneck

Port Already in Use

ImageNet Validation Issues

Example Training Scripts

Small-Scale Experiment (RN50 on CC3M)

Medium-Scale (ViT-B/32 on CC12M)

Large Model (ViT-L/14)

Next Steps