Skip to main content

Overview

OpenCLIP provides advanced distributed training techniques that enable efficient training on hundreds or thousands of GPUs. This guide covers memory-efficient distributed loss computation, gradient accumulation, mixed precision training, and performance optimizations.

Memory-Efficient Distributed Loss

The standard CLIP contrastive loss requires computing a logit matrix of size (batch_size × num_gpus) × (batch_size × num_gpus), leading to O(n²) memory complexity. For large-scale training, this becomes a bottleneck.

The Problem: O(n²) Memory Complexity

Without optimization:
Global batch size: 256 samples/GPU × 128 GPUs = 32,768 samples
Logit matrix size: 32,768 × 32,768 = 1,073,741,824 elements
Memory (fp32): ~4 GB just for the logit matrix
This scales poorly and limits the number of GPUs you can use.

The Solution: Local Loss with Gradient Gathering

OpenCLIP implements an efficient distributed loss computation that achieves O(n) memory complexity while maintaining identical numerical results.
python -m open_clip_train.main \
    --local-loss \
    --gather-with-grad \
    # ... other arguments
With optimization:
Memory per GPU: O(local_batch_size) instead of O(global_batch_size²)
Numerical results: Identical to the naïve all-gather approach

How It Works

  1. --local-loss: Compute loss locally with gathered features, avoiding full global logit matrix
  2. --gather-with-grad: Enable gradient flow through the feature gathering operation
Together, these flags enable linear memory scaling:
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --batch-size 256 \
    --local-loss \
    --gather-with-grad \
    --model ViT-B-32
Always use --local-loss and --gather-with-grad together for multi-node training (8+ GPUs). These flags are essential for scaling beyond small clusters.

Gradient Accumulation

Gradient accumulation simulates larger batch sizes by accumulating gradients over multiple forward passes before updating weights.

Basic Usage

python -m open_clip_train.main \
    --batch-size 128 \
    --accum-freq 4 \
    # ... other arguments
Effective batch size:
effective_batch = batch_size × num_gpus × accum_freq
                = 128 × 8 × 4
                = 4,096

When to Use Gradient Accumulation

Use --accum-freq when:
  1. ✅ You need larger effective batch sizes than GPU memory allows
  2. ✅ You want to maintain a specific batch size across different hardware
  3. ✅ You’re experimenting with very large batch sizes (>100k)
However, try these first:
  1. Enable --precision amp (mixed precision)
  2. Use --grad-checkpointing (memory-compute tradeoff)
  3. Use --local-loss --gather-with-grad (for distributed training)

Performance Implications

Speed: Samples/sec remains approximately constant
  • Without accumulation: Process 1024 samples in 1 step
  • With --accum-freq 4: Process 256 samples × 4 in 4 steps
  • Net throughput: Similar
Memory: Additional GPU memory required for:
  • Cached features from all accumulated batches
  • Multiple loss computations (one per accumulated batch)

Real-World Example

# Training ViT-L-14 on 8 A100 GPUs (40GB)
# Without gradient accumulation: OOM at batch_size 256
# With gradient accumulation: Works!

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data '/data/laion400m/{00000..41455}.tar' \
    --batch-size 128 \
    --accum-freq 2 \
    --precision amp \
    --grad-checkpointing \
    --model ViT-L-14 \
    --local-loss \
    --gather-with-grad
# Effective batch: 128 × 8 × 2 = 2,048

Implementation Details

Gradient accumulation in OpenCLIP:
  1. First N-1 steps: Forward pass with torch.no_grad(), cache features
  2. Nth step: Re-run forward passes with gradients enabled
  3. Backward: Compute gradients using cached features as negatives
  4. Step: Update optimizer
References:

Mixed Precision Training

Mixed precision training uses lower precision (float16 or bfloat16) for most computations while maintaining float32 for critical operations.

Automatic Mixed Precision (AMP)

# FP16 with automatic loss scaling (recommended)
python -m open_clip_train.main \
    --precision amp \
    # ... other arguments

# BFloat16 (if supported by hardware)
python -m open_clip_train.main \
    --precision amp_bf16 \
    # ... other arguments
Benefits:
  • 🚀 Speed: 2-3× faster training on modern GPUs (A100, H100)
  • 💾 Memory: ~50% reduction in activation memory
  • 📊 Accuracy: Negligible impact with automatic loss scaling

Precision Options

PrecisionDescriptionUse Case
ampAutomatic Mixed Precision (FP16)Recommended for most training
amp_bf16AMP with BFloat16A100/H100 GPUs, more stable than FP16
fp32Full 32-bit precisionDebugging, baseline comparison
fp16Pure FP16 (not recommended)Legacy, requires manual tuning
bf16Pure BFloat16Experimental

Hardware Support

NVIDIA GPUs:
  • Volta (V100): FP16 via --precision amp
  • Ampere (A100): FP16 or BF16 via --precision amp or --precision amp_bf16
  • Hopper (H100): BF16 recommended via --precision amp_bf16
AMD GPUs:
  • MI250X: FP16 via --precision amp

Example: Mixed Precision Training

torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --batch-size 512 \
    --precision amp \
    --model ViT-B-32
# Batch size doubled thanks to reduced memory usage
Avoid using --precision fp16 (pure FP16) without automatic loss scaling. Use --precision amp instead, which handles loss scaling automatically.

Patch Dropout for Vision Transformers

Patch dropout randomly drops image patches during training, leading to 2-3× speedup for Vision Transformer models without accuracy loss.

Research Background

Li et al. 2022 showed that dropping 50-75% of visual tokens during training:
  • ✅ Speeds up training by 2-3×
  • ✅ Maintains final accuracy
  • ✅ Acts as a form of data augmentation

Usage

Set patch dropout in your model config or via command-line:
# Enable 50% patch dropout during training
python -m open_clip_train.main \
    --model ViT-B-32 \
    --force-patch-dropout 0.5 \
    --epochs 30 \
    # ... other arguments

Fine-tuning Without Patch Dropout

The paper recommends fine-tuning without patch dropout at the end:
# Phase 1: Train with patch dropout (fast)
python -m open_clip_train.main \
    --model ViT-B-32 \
    --force-patch-dropout 0.5 \
    --epochs 28 \
    --name "phase1-with-dropout"

# Phase 2: Fine-tune without patch dropout (accuracy)
python -m open_clip_train.main \
    --model ViT-B-32 \
    --force-patch-dropout 0.0 \
    --resume /path/to/phase1/epoch_28.pt \
    --epochs 32 \
    --name "phase2-no-dropout"
Model SizePatch DropoutTraining Speedup
ViT-B/320.5~2×
ViT-B/160.5~2×
ViT-L/140.5-0.75~2-3×
ViT-H/140.75~3×
Patch dropout only applies to Vision Transformer models. It has no effect on ResNet or ConvNext models.

Gradient Checkpointing

Gradient checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them.
python -m open_clip_train.main \
    --grad-checkpointing \
    # ... other arguments
Tradeoffs:
  • Memory: 30-50% reduction in activation memory
  • Speed: ~20% slower due to recomputation
  • Batch Size: Allows larger batch sizes
When to use:
  • Training large models (ViT-L, ViT-H, ViT-bigG)
  • Out of memory errors even with mixed precision
  • Prefer larger batch sizes over speed
Example:
# ViT-H-14 requires gradient checkpointing on A100-40GB
torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model ViT-H-14 \
    --batch-size 64 \
    --precision amp \
    --grad-checkpointing \
    --local-loss \
    --gather-with-grad

SyncBatchNorm

Synchronize batch normalization statistics across GPUs for models with BatchNorm layers.
python -m open_clip_train.main \
    --use-bn-sync \
    # ... other arguments (for models with BatchNorm)
When to use:
  • ResNet models with BatchNorm layers
  • Small batch sizes per GPU (<32)
Not needed for:
  • Vision Transformers (use LayerNorm)
  • Large batch sizes (>128 per GPU)

Combining Techniques

Small-Scale Training (1-4 GPUs)

torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --batch-size 256 \
    --precision amp \
    --model ViT-B-32
# Simple and effective

Medium-Scale Training (8-32 GPUs)

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data '/data/train.tar' \
    --batch-size 128 \
    --precision amp \
    --local-loss \
    --gather-with-grad \
    --model ViT-L-14
# Add distributed loss optimization

Large-Scale Training (64+ GPUs)

srun python -u src/open_clip_train/main.py \
    --train-data '/data/train.tar' \
    --batch-size 256 \
    --precision amp_bf16 \
    --grad-checkpointing \
    --force-patch-dropout 0.5 \
    --local-loss \
    --gather-with-grad \
    --model ViT-H-14
# All optimizations enabled

Very Large-Scale (256-1024 GPUs)

srun python -u src/open_clip_train/main.py \
    --train-data '/data/laion-2b/{00000..100000}.tar' \
    --train-num-samples 2000000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 256 \
    --precision amp_bf16 \
    --grad-checkpointing \
    --force-patch-dropout 0.75 \
    --local-loss \
    --gather-with-grad \
    --workers 12 \
    --model ViT-bigG-14 \
    --remote-sync s3://bucket/checkpoints \
    --delete-previous-checkpoint
# Maximum optimization for largest scale

DDP Static Graph

PyTorch 1.11+ supports static graph optimization for DistributedDataParallel:
python -m open_clip_train.main \
    --ddp-static-graph \
    # ... other arguments
Benefits:
  • Slightly faster gradient synchronization
  • Lower memory overhead
Requirements:
  • Model architecture doesn’t change during training
  • PyTorch >= 1.11

Torch Compile

PyTorch 2.0+ supports model compilation for faster execution:
python -m open_clip_train.main \
    --torchcompile \
    # ... other arguments
Benefits:
  • 10-30% speedup on A100/H100 GPUs
  • Automatic kernel fusion and optimization
Considerations:
  • First epoch is slower (compilation time)
  • Requires PyTorch >= 2.0
  • May have compatibility issues with some models
If using --grad-checkpointing with --torchcompile and DDP, the DDP dynamo optimizer is automatically disabled to avoid compatibility issues.

Int8 Training (Experimental)

OpenCLIP has beta support for int8 training using bitsandbytes:
pip install bitsandbytes triton

python -m open_clip_train.main \
    --use-bnb-linear SwitchBackLinearGlobal \
    # ... other arguments
Benefits:
  • ~10% training speedup for ViT-Huge
  • Reduced memory usage
  • No accuracy loss (preliminary results)
Status: Experimental, see tutorial

Performance Monitoring

GPU Utilization

Monitor GPU usage during training:
watch -n 1 nvidia-smi
Target: 90-100% GPU utilization If GPU utilization is low (<80%):
  1. Increase --workers (data loading parallelism)
  2. Use faster storage (NVMe SSD)
  3. Increase --batch-size if memory allows
  4. Profile data loading pipeline

Throughput Measurement

OpenCLIP logs samples/sec during training:
Train Epoch: 0 [1280/10968539 (0%)] Batch (t): 0.850, 1506/s, 376/s/gpu
  • 1506/s: Global samples per second (all GPUs)
  • 376/s/gpu: Samples per second per GPU
Typical values on A100 (40GB):
ModelBatch/GPUPrecisionSamples/sec/GPU
RN50512amp~1000
ViT-B/32320amp~600
ViT-L/14128amp~150
ViT-H/1464amp~50

Communication Overhead

For multi-node training, monitor network bandwidth:
iftop -i ib0  # InfiniBand
iftop -i eth0 # Ethernet
Expect: High bandwidth during gradient synchronization (every step)

Best Practices Summary

✅ Always Use

  1. Mixed Precision: --precision amp (or amp_bf16 on A100/H100)
  2. Distributed Loss: --local-loss --gather-with-grad (for 8+ GPUs)
  3. WebDataset: For datasets >10M samples

✅ Use When Needed

  1. Gradient Checkpointing: --grad-checkpointing (for large models)
  2. Patch Dropout: --force-patch-dropout 0.5 (for ViT models)
  3. Gradient Accumulation: --accum-freq N (when other options exhausted)

✅ Optimize For Your Setup

  1. Workers: --workers 4-12 (match to CPU cores per GPU)
  2. Batch Size: Maximize per GPU (limited by memory)
  3. Save Frequency: --save-frequency 1 (or less frequent for large models)

❌ Avoid

  • Pure FP16: Use --precision amp instead
  • Small batches: <64 per GPU reduces efficiency
  • Too many workers: >16 per GPU causes overhead
  • CSV format for large datasets: Use WebDataset

Next Steps

Multi-Node Training

Scale training across multiple machines with SLURM

Configuration

Explore all training configuration options

Data Preparation

Prepare datasets for efficient distributed training

Fine-tuning

Fine-tune pretrained models with distributed techniques

Build docs developers (and LLMs) love