Overview
OpenCLIP provides advanced distributed training techniques that enable efficient training on hundreds or thousands of GPUs. This guide covers memory-efficient distributed loss computation, gradient accumulation, mixed precision training, and performance optimizations.Memory-Efficient Distributed Loss
The standard CLIP contrastive loss requires computing a logit matrix of size(batch_size × num_gpus) × (batch_size × num_gpus), leading to O(n²) memory complexity. For large-scale training, this becomes a bottleneck.
The Problem: O(n²) Memory Complexity
Without optimization:The Solution: Local Loss with Gradient Gathering
OpenCLIP implements an efficient distributed loss computation that achieves O(n) memory complexity while maintaining identical numerical results.How It Works
--local-loss: Compute loss locally with gathered features, avoiding full global logit matrix--gather-with-grad: Enable gradient flow through the feature gathering operation
Gradient Accumulation
Gradient accumulation simulates larger batch sizes by accumulating gradients over multiple forward passes before updating weights.Basic Usage
When to Use Gradient Accumulation
Use--accum-freq when:
- ✅ You need larger effective batch sizes than GPU memory allows
- ✅ You want to maintain a specific batch size across different hardware
- ✅ You’re experimenting with very large batch sizes (>100k)
- Enable
--precision amp(mixed precision) - Use
--grad-checkpointing(memory-compute tradeoff) - Use
--local-loss --gather-with-grad(for distributed training)
Performance Implications
Speed: Samples/sec remains approximately constant
- Without accumulation: Process 1024 samples in 1 step
- With
--accum-freq 4: Process 256 samples × 4 in 4 steps - Net throughput: Similar
- Cached features from all accumulated batches
- Multiple loss computations (one per accumulated batch)
Real-World Example
Implementation Details
Gradient accumulation in OpenCLIP:- First N-1 steps: Forward pass with
torch.no_grad(), cache features - Nth step: Re-run forward passes with gradients enabled
- Backward: Compute gradients using cached features as negatives
- Step: Update optimizer
Mixed Precision Training
Mixed precision training uses lower precision (float16 or bfloat16) for most computations while maintaining float32 for critical operations.Automatic Mixed Precision (AMP)
- 🚀 Speed: 2-3× faster training on modern GPUs (A100, H100)
- 💾 Memory: ~50% reduction in activation memory
- 📊 Accuracy: Negligible impact with automatic loss scaling
Precision Options
| Precision | Description | Use Case |
|---|---|---|
amp | Automatic Mixed Precision (FP16) | Recommended for most training |
amp_bf16 | AMP with BFloat16 | A100/H100 GPUs, more stable than FP16 |
fp32 | Full 32-bit precision | Debugging, baseline comparison |
fp16 | Pure FP16 (not recommended) | Legacy, requires manual tuning |
bf16 | Pure BFloat16 | Experimental |
Hardware Support
NVIDIA GPUs:- Volta (V100): FP16 via
--precision amp - Ampere (A100): FP16 or BF16 via
--precision ampor--precision amp_bf16 - Hopper (H100): BF16 recommended via
--precision amp_bf16
- MI250X: FP16 via
--precision amp
Example: Mixed Precision Training
Patch Dropout for Vision Transformers
Patch dropout randomly drops image patches during training, leading to 2-3× speedup for Vision Transformer models without accuracy loss.Research Background
Li et al. 2022 showed that dropping 50-75% of visual tokens during training:- ✅ Speeds up training by 2-3×
- ✅ Maintains final accuracy
- ✅ Acts as a form of data augmentation
Usage
Set patch dropout in your model config or via command-line:Fine-tuning Without Patch Dropout
The paper recommends fine-tuning without patch dropout at the end:Recommended Values
| Model Size | Patch Dropout | Training Speedup |
|---|---|---|
| ViT-B/32 | 0.5 | ~2× |
| ViT-B/16 | 0.5 | ~2× |
| ViT-L/14 | 0.5-0.75 | ~2-3× |
| ViT-H/14 | 0.75 | ~3× |
Patch dropout only applies to Vision Transformer models. It has no effect on ResNet or ConvNext models.
Gradient Checkpointing
Gradient checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them.- ✅ Memory: 30-50% reduction in activation memory
- ❌ Speed: ~20% slower due to recomputation
- ✅ Batch Size: Allows larger batch sizes
- Training large models (ViT-L, ViT-H, ViT-bigG)
- Out of memory errors even with mixed precision
- Prefer larger batch sizes over speed
SyncBatchNorm
Synchronize batch normalization statistics across GPUs for models with BatchNorm layers.- ResNet models with BatchNorm layers
- Small batch sizes per GPU (<32)
- Vision Transformers (use LayerNorm)
- Large batch sizes (>128 per GPU)
Combining Techniques
Small-Scale Training (1-4 GPUs)
Medium-Scale Training (8-32 GPUs)
Large-Scale Training (64+ GPUs)
Very Large-Scale (256-1024 GPUs)
DDP Static Graph
PyTorch 1.11+ supports static graph optimization for DistributedDataParallel:- Slightly faster gradient synchronization
- Lower memory overhead
- Model architecture doesn’t change during training
- PyTorch >= 1.11
Torch Compile
PyTorch 2.0+ supports model compilation for faster execution:- 10-30% speedup on A100/H100 GPUs
- Automatic kernel fusion and optimization
- First epoch is slower (compilation time)
- Requires PyTorch >= 2.0
- May have compatibility issues with some models
Int8 Training (Experimental)
OpenCLIP has beta support for int8 training using bitsandbytes:- ~10% training speedup for ViT-Huge
- Reduced memory usage
- No accuracy loss (preliminary results)
Performance Monitoring
GPU Utilization
Monitor GPU usage during training:- Increase
--workers(data loading parallelism) - Use faster storage (NVMe SSD)
- Increase
--batch-sizeif memory allows - Profile data loading pipeline
Throughput Measurement
OpenCLIP logs samples/sec during training:- 1506/s: Global samples per second (all GPUs)
- 376/s/gpu: Samples per second per GPU
| Model | Batch/GPU | Precision | Samples/sec/GPU |
|---|---|---|---|
| RN50 | 512 | amp | ~1000 |
| ViT-B/32 | 320 | amp | ~600 |
| ViT-L/14 | 128 | amp | ~150 |
| ViT-H/14 | 64 | amp | ~50 |
Communication Overhead
For multi-node training, monitor network bandwidth:Best Practices Summary
✅ Always Use
- Mixed Precision:
--precision amp(oramp_bf16on A100/H100) - Distributed Loss:
--local-loss --gather-with-grad(for 8+ GPUs) - WebDataset: For datasets >10M samples
✅ Use When Needed
- Gradient Checkpointing:
--grad-checkpointing(for large models) - Patch Dropout:
--force-patch-dropout 0.5(for ViT models) - Gradient Accumulation:
--accum-freq N(when other options exhausted)
✅ Optimize For Your Setup
- Workers:
--workers 4-12(match to CPU cores per GPU) - Batch Size: Maximize per GPU (limited by memory)
- Save Frequency:
--save-frequency 1(or less frequent for large models)
❌ Avoid
- Pure FP16: Use
--precision ampinstead - Small batches: <64 per GPU reduces efficiency
- Too many workers: >16 per GPU causes overhead
- CSV format for large datasets: Use WebDataset
Next Steps
Multi-Node Training
Scale training across multiple machines with SLURM
Configuration
Explore all training configuration options
Data Preparation
Prepare datasets for efficient distributed training
Fine-tuning
Fine-tune pretrained models with distributed techniques
