Distributed Training

Overview

OpenCLIP provides advanced distributed training techniques that enable efficient training on hundreds or thousands of GPUs. This guide covers memory-efficient distributed loss computation, gradient accumulation, mixed precision training, and performance optimizations.

Memory-Efficient Distributed Loss

The standard CLIP contrastive loss requires computing a logit matrix of size (batch_size × num_gpus) × (batch_size × num_gpus), leading to O(n²) memory complexity. For large-scale training, this becomes a bottleneck.

The Problem: O(n²) Memory Complexity

Without optimization:

Global batch size: 256 samples/GPU × 128 GPUs = 32,768 samples
Logit matrix size: 32,768 × 32,768 = 1,073,741,824 elements
Memory (fp32): ~4 GB just for the logit matrix

This scales poorly and limits the number of GPUs you can use.

The Solution: Local Loss with Gradient Gathering

OpenCLIP implements an efficient distributed loss computation that achieves O(n) memory complexity while maintaining identical numerical results.

python -m open_clip_train.main \
    --local-loss \
    --gather-with-grad \
    # ... other arguments

With optimization:

Memory per GPU: O(local_batch_size) instead of O(global_batch_size²)
Numerical results: Identical to the naïve all-gather approach

How It Works

--local-loss: Compute loss locally with gathered features, avoiding full global logit matrix
--gather-with-grad: Enable gradient flow through the feature gathering operation

Together, these flags enable linear memory scaling:

torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --batch-size 256 \
    --local-loss \
    --gather-with-grad \
    --model ViT-B-32

Always use --local-loss and --gather-with-grad together for multi-node training (8+ GPUs). These flags are essential for scaling beyond small clusters.

Gradient Accumulation

Gradient accumulation simulates larger batch sizes by accumulating gradients over multiple forward passes before updating weights.

Basic Usage

python -m open_clip_train.main \
    --batch-size 128 \
    --accum-freq 4 \
    # ... other arguments

Effective batch size:

effective_batch = batch_size × num_gpus × accum_freq
                = 128 × 8 × 4
                = 4,096

When to Use Gradient Accumulation

Use --accum-freq when:

✅ You need larger effective batch sizes than GPU memory allows
✅ You want to maintain a specific batch size across different hardware
✅ You’re experimenting with very large batch sizes (>100k)

However, try these first:

Enable --precision amp (mixed precision)
Use --grad-checkpointing (memory-compute tradeoff)
Use --local-loss --gather-with-grad (for distributed training)

Performance Implications

Speed: Samples/sec remains approximately constant

Without accumulation: Process 1024 samples in 1 step
With --accum-freq 4: Process 256 samples × 4 in 4 steps
Net throughput: Similar

Memory: Additional GPU memory required for:

Cached features from all accumulated batches
Multiple loss computations (one per accumulated batch)

Real-World Example

# Training ViT-L-14 on 8 A100 GPUs (40GB)
# Without gradient accumulation: OOM at batch_size 256
# With gradient accumulation: Works!

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data '/data/laion400m/{00000..41455}.tar' \
    --batch-size 128 \
    --accum-freq 2 \
    --precision amp \
    --grad-checkpointing \
    --model ViT-L-14 \
    --local-loss \
    --gather-with-grad
# Effective batch: 128 × 8 × 2 = 2,048

Implementation Details

Gradient accumulation in OpenCLIP:

First N-1 steps: Forward pass with torch.no_grad(), cache features
Nth step: Re-run forward passes with gradients enabled
Backward: Compute gradients using cached features as negatives
Step: Update optimizer

References:

Mixed Precision Training

Mixed precision training uses lower precision (float16 or bfloat16) for most computations while maintaining float32 for critical operations.

Automatic Mixed Precision (AMP)

# FP16 with automatic loss scaling (recommended)
python -m open_clip_train.main \
    --precision amp \
    # ... other arguments

# BFloat16 (if supported by hardware)
python -m open_clip_train.main \
    --precision amp_bf16 \
    # ... other arguments

Benefits:

🚀 Speed: 2-3× faster training on modern GPUs (A100, H100)
💾 Memory: ~50% reduction in activation memory
📊 Accuracy: Negligible impact with automatic loss scaling

Precision Options

Precision	Description	Use Case
`amp`	Automatic Mixed Precision (FP16)	Recommended for most training
`amp_bf16`	AMP with BFloat16	A100/H100 GPUs, more stable than FP16
`fp32`	Full 32-bit precision	Debugging, baseline comparison
`fp16`	Pure FP16 (not recommended)	Legacy, requires manual tuning
`bf16`	Pure BFloat16	Experimental

Hardware Support

NVIDIA GPUs:

Volta (V100): FP16 via --precision amp
Ampere (A100): FP16 or BF16 via --precision amp or --precision amp_bf16
Hopper (H100): BF16 recommended via --precision amp_bf16

AMD GPUs:

MI250X: FP16 via --precision amp

Example: Mixed Precision Training

torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --batch-size 512 \
    --precision amp \
    --model ViT-B-32
# Batch size doubled thanks to reduced memory usage

Avoid using --precision fp16 (pure FP16) without automatic loss scaling. Use --precision amp instead, which handles loss scaling automatically.

Patch Dropout for Vision Transformers

Patch dropout randomly drops image patches during training, leading to 2-3× speedup for Vision Transformer models without accuracy loss.

Research Background

Li et al. 2022 showed that dropping 50-75% of visual tokens during training:

✅ Speeds up training by 2-3×
✅ Maintains final accuracy
✅ Acts as a form of data augmentation

Usage

Set patch dropout in your model config or via command-line:

# Enable 50% patch dropout during training
python -m open_clip_train.main \
    --model ViT-B-32 \
    --force-patch-dropout 0.5 \
    --epochs 30 \
    # ... other arguments

Fine-tuning Without Patch Dropout

The paper recommends fine-tuning without patch dropout at the end:

# Phase 1: Train with patch dropout (fast)
python -m open_clip_train.main \
    --model ViT-B-32 \
    --force-patch-dropout 0.5 \
    --epochs 28 \
    --name "phase1-with-dropout"

# Phase 2: Fine-tune without patch dropout (accuracy)
python -m open_clip_train.main \
    --model ViT-B-32 \
    --force-patch-dropout 0.0 \
    --resume /path/to/phase1/epoch_28.pt \
    --epochs 32 \
    --name "phase2-no-dropout"

Recommended Values

Model Size	Patch Dropout	Training Speedup
ViT-B/32	0.5	~2×
ViT-B/16	0.5	~2×
ViT-L/14	0.5-0.75	~2-3×
ViT-H/14	0.75	~3×

Patch dropout only applies to Vision Transformer models. It has no effect on ResNet or ConvNext models.

Gradient Checkpointing

Gradient checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them.

python -m open_clip_train.main \
    --grad-checkpointing \
    # ... other arguments

Tradeoffs:

✅ Memory: 30-50% reduction in activation memory
❌ Speed: ~20% slower due to recomputation
✅ Batch Size: Allows larger batch sizes

When to use:

Training large models (ViT-L, ViT-H, ViT-bigG)
Out of memory errors even with mixed precision
Prefer larger batch sizes over speed

Example:

# ViT-H-14 requires gradient checkpointing on A100-40GB
torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model ViT-H-14 \
    --batch-size 64 \
    --precision amp \
    --grad-checkpointing \
    --local-loss \
    --gather-with-grad

SyncBatchNorm

Synchronize batch normalization statistics across GPUs for models with BatchNorm layers.

python -m open_clip_train.main \
    --use-bn-sync \
    # ... other arguments (for models with BatchNorm)

When to use:

ResNet models with BatchNorm layers
Small batch sizes per GPU (<32)

Not needed for:

Vision Transformers (use LayerNorm)
Large batch sizes (>128 per GPU)

Combining Techniques

Small-Scale Training (1-4 GPUs)

torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --batch-size 256 \
    --precision amp \
    --model ViT-B-32
# Simple and effective

Medium-Scale Training (8-32 GPUs)

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --batch-size 128 \
    --precision amp \
    --local-loss \
    --gather-with-grad \
    --model ViT-L-14
# Add distributed loss optimization

Large-Scale Training (64+ GPUs)

srun python -u src/open_clip_train/main.py \
    --train-data '/data/train.tar' \
    --batch-size 256 \
    --precision amp_bf16 \
    --grad-checkpointing \
    --force-patch-dropout 0.5 \
    --local-loss \
    --gather-with-grad \
    --model ViT-H-14
# All optimizations enabled

Very Large-Scale (256-1024 GPUs)

srun python -u src/open_clip_train/main.py \
    --train-data '/data/laion-2b/{00000..100000}.tar' \
    --train-num-samples 2000000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 256 \
    --precision amp_bf16 \
    --grad-checkpointing \
    --force-patch-dropout 0.75 \
    --local-loss \
    --gather-with-grad \
    --workers 12 \
    --model ViT-bigG-14 \
    --remote-sync s3://bucket/checkpoints \
    --delete-previous-checkpoint
# Maximum optimization for largest scale

DDP Static Graph

PyTorch 1.11+ supports static graph optimization for DistributedDataParallel:

python -m open_clip_train.main \
    --ddp-static-graph \
    # ... other arguments

Benefits:

Slightly faster gradient synchronization
Lower memory overhead

Requirements:

Model architecture doesn’t change during training
PyTorch >= 1.11

Torch Compile

PyTorch 2.0+ supports model compilation for faster execution:

python -m open_clip_train.main \
    --torchcompile \
    # ... other arguments

Benefits:

10-30% speedup on A100/H100 GPUs
Automatic kernel fusion and optimization

Considerations:

First epoch is slower (compilation time)
Requires PyTorch >= 2.0
May have compatibility issues with some models

If using --grad-checkpointing with --torchcompile and DDP, the DDP dynamo optimizer is automatically disabled to avoid compatibility issues.

Int8 Training (Experimental)

OpenCLIP has beta support for int8 training using bitsandbytes:

pip install bitsandbytes triton

python -m open_clip_train.main \
    --use-bnb-linear SwitchBackLinearGlobal \
    # ... other arguments

Benefits:

~10% training speedup for ViT-Huge
Reduced memory usage
No accuracy loss (preliminary results)

Status: Experimental, see tutorial

Performance Monitoring

GPU Utilization

Monitor GPU usage during training:

watch -n 1 nvidia-smi

Target: 90-100% GPU utilization If GPU utilization is low (<80%):

Increase --workers (data loading parallelism)
Use faster storage (NVMe SSD)
Increase --batch-size if memory allows
Profile data loading pipeline

Throughput Measurement

OpenCLIP logs samples/sec during training:

Train Epoch: 0 [1280/10968539 (0%)] Batch (t): 0.850, 1506/s, 376/s/gpu

1506/s: Global samples per second (all GPUs)
376/s/gpu: Samples per second per GPU

Typical values on A100 (40GB):

Model	Batch/GPU	Precision	Samples/sec/GPU
RN50	512	amp	~1000
ViT-B/32	320	amp	~600
ViT-L/14	128	amp	~150
ViT-H/14	64	amp	~50

Communication Overhead

For multi-node training, monitor network bandwidth:

iftop -i ib0  # InfiniBand
iftop -i eth0 # Ethernet

Expect: High bandwidth during gradient synchronization (every step)

Best Practices Summary

✅ Always Use

Mixed Precision: --precision amp (or amp_bf16 on A100/H100)
Distributed Loss: --local-loss --gather-with-grad (for 8+ GPUs)
WebDataset: For datasets >10M samples

✅ Use When Needed

Gradient Checkpointing: --grad-checkpointing (for large models)
Patch Dropout: --force-patch-dropout 0.5 (for ViT models)
Gradient Accumulation: --accum-freq N (when other options exhausted)

✅ Optimize For Your Setup

Workers: --workers 4-12 (match to CPU cores per GPU)
Batch Size: Maximize per GPU (limited by memory)
Save Frequency: --save-frequency 1 (or less frequent for large models)

❌ Avoid

Pure FP16: Use --precision amp instead
Small batches: <64 per GPU reduces efficiency
Too many workers: >16 per GPU causes overhead
CSV format for large datasets: Use WebDataset

Next Steps

Multi-Node Training

Scale training across multiple machines with SLURM

Configuration

Explore all training configuration options

Data Preparation

Prepare datasets for efficient distributed training

Fine-tuning

Fine-tune pretrained models with distributed techniques

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

​Overview

​Memory-Efficient Distributed Loss

​The Problem: O(n²) Memory Complexity

​The Solution: Local Loss with Gradient Gathering

​How It Works

​Gradient Accumulation

​Basic Usage

​When to Use Gradient Accumulation

​Performance Implications

​Real-World Example

​Implementation Details

​Mixed Precision Training

​Automatic Mixed Precision (AMP)

​Precision Options

​Hardware Support

​Example: Mixed Precision Training

​Patch Dropout for Vision Transformers

​Research Background

​Usage

​Fine-tuning Without Patch Dropout

​Recommended Values

​Gradient Checkpointing

​SyncBatchNorm

​Combining Techniques

​Small-Scale Training (1-4 GPUs)

​Medium-Scale Training (8-32 GPUs)

​Large-Scale Training (64+ GPUs)

​Very Large-Scale (256-1024 GPUs)

​DDP Static Graph

​Torch Compile

​Int8 Training (Experimental)

​Performance Monitoring

​GPU Utilization

​Throughput Measurement

​Communication Overhead

​Best Practices Summary

​✅ Always Use

​✅ Use When Needed

​✅ Optimize For Your Setup

​❌ Avoid

​Next Steps

Multi-Node Training

Configuration

Data Preparation

Fine-tuning

Build docs developers (and LLMs) love

Overview

Memory-Efficient Distributed Loss

The Problem: O(n²) Memory Complexity

The Solution: Local Loss with Gradient Gathering

How It Works

Gradient Accumulation

Basic Usage

When to Use Gradient Accumulation

Performance Implications

Real-World Example

Implementation Details

Mixed Precision Training

Automatic Mixed Precision (AMP)

Precision Options

Hardware Support

Example: Mixed Precision Training

Patch Dropout for Vision Transformers

Research Background

Usage

Fine-tuning Without Patch Dropout

Recommended Values

Gradient Checkpointing

SyncBatchNorm

Combining Techniques

Small-Scale Training (1-4 GPUs)

Medium-Scale Training (8-32 GPUs)

Large-Scale Training (64+ GPUs)

Very Large-Scale (256-1024 GPUs)

DDP Static Graph

Torch Compile

Int8 Training (Experimental)

Performance Monitoring

GPU Utilization

Throughput Measurement

Communication Overhead

Best Practices Summary

✅ Always Use

✅ Use When Needed

✅ Optimize For Your Setup

❌ Avoid

Next Steps