Skip to main content
Gradient accumulation is a technique that allows you to train with effectively larger batch sizes than your GPU memory would normally allow. This is crucial for CLIP training, where larger batch sizes typically lead to better performance.

Overview

Instead of updating model weights after every batch, gradient accumulation:
  1. Computes gradients for multiple small batches
  2. Accumulates (sums) these gradients
  3. Updates the model weights once after processing all accumulated batches
This simulates training with a batch size of batch_size × accum_freq × num_gpus.

Basic Usage

Use the --accum-freq flag to specify how many batches to accumulate:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --batch-size 128 \
    --accum-freq 4 \
    --train-data "/path/to/train_data.tar" \
    --dataset-type webdataset \
    --epochs 32
In this example:
  • Per-GPU batch size: 128
  • Accumulation frequency: 4
  • Effective batch size per GPU: 128 × 4 = 512
  • With 8 GPUs: Total effective batch size = 512 × 8 = 4,096

How It Works

Gradient accumulation modifies the training loop:

Without Gradient Accumulation (accum-freq = 1)

for batch in dataloader:
    outputs = model(batch)  # Forward pass
    loss = criterion(outputs)
    loss.backward()          # Compute gradients
    optimizer.step()         # Update weights
    optimizer.zero_grad()    # Clear gradients

With Gradient Accumulation (accum-freq = 4)

for i, batch in enumerate(dataloader):
    outputs = model(batch)      # Forward pass
    loss = criterion(outputs)
    loss.backward()              # Accumulate gradients
    
    if (i + 1) % accum_freq == 0:
        optimizer.step()         # Update weights
        optimizer.zero_grad()    # Clear gradients

Effective Batch Size Calculation

The effective batch size is:
Effective Batch Size = batch_size × accum_freq × num_gpus
Examples:
Per-GPU BatchAccum FreqGPUsEffective Batch Size
128181,024
128282,048
128484,096
64884,096
256141,024
256444,096

Memory vs Speed Tradeoffs

Memory Considerations

Advantages:
  • Reduces per-step memory usage for model activations
  • Enables training larger models on limited hardware
  • Allows simulation of large batch sizes
Costs:
  • Features from all accumulated batches are stored in memory
  • Additional memory needed for intermediate loss computations
  • Each batch’s features are cached until the update step

Speed Considerations

Impact on Training Speed:
  • ~2× forward passes per example (one with gradients, one without)
  • Samples per second remains approximately constant
  • Time per update step increases proportionally with accum_freq
  • Overall throughput (samples/second) stays similar
Example Performance:
accum-freq=1: 100 steps/epoch, 1000 samples/sec
accum-freq=4: 25 steps/epoch, 1000 samples/sec
Note: You process the same data but with fewer parameter updates.

When to Use Gradient Accumulation

Use Gradient Accumulation When:

  1. GPU Memory is Limited
    • Cannot fit desired batch size in memory
    • Training large models (ViT-L, ViT-H, ViT-g)
    • Using high-resolution images
  2. Constrained GPU Resources
    • Limited number of GPUs available
    • Need to match batch sizes from papers
    • Simulating larger-scale training
  3. After Trying Other Techniques
    • Already using --grad-checkpointing
    • Already using --local-loss --gather-with-grad
    • Already optimized per-GPU batch size

Avoid When:

  1. Memory is Sufficient: If you can fit larger batches, do so directly
  2. Using Distillation: Distillation requires --accum-freq 1
  3. Training is Already Slow: Gradient accumulation adds overhead
Follow this sequence to optimize batch size:
# 1. Start with largest possible batch size
--batch-size 512

# 2. If OOM, enable memory-saving features
--batch-size 512 \
--grad-checkpointing \
--local-loss \
--gather-with-grad

# 3. If still OOM, reduce batch size
--batch-size 256 \
--grad-checkpointing \
--local-loss \
--gather-with-grad

# 4. Finally, use gradient accumulation to increase effective batch size
--batch-size 256 \
--accum-freq 2 \
--grad-checkpointing \
--local-loss \
--gather-with-grad

Examples

Single GPU Training

Simulate a large batch size on a single GPU:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --batch-size 64 \
    --accum-freq 16 \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --precision amp \
    --workers 4 \
    --epochs 32
Effective batch size: 64 × 16 = 1,024

Multi-GPU Training

Scale to very large batch sizes:
torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model ViT-L-14 \
    --batch-size 128 \
    --accum-freq 4 \
    --train-data "/data/laion400m/train-{0000..4000}.tar" \
    --dataset-type webdataset \
    --precision amp \
    --workers 8 \
    --grad-checkpointing \
    --local-loss \
    --gather-with-grad \
    --epochs 32
Effective batch size: 128 × 4 × 8 = 4,096

Large Model Training

Train huge models with gradient accumulation:
torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model ViT-H-14 \
    --batch-size 32 \
    --accum-freq 8 \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --precision amp \
    --workers 8 \
    --grad-checkpointing \
    --local-loss \
    --gather-with-grad \
    --epochs 32 \
    --lr 1e-3 \
    --warmup 2000
Effective batch size: 32 × 8 × 8 = 2,048

High Resolution Images

Train with larger image sizes:
python -m open_clip_train.main \
    --model ViT-L-14-336 \
    --batch-size 64 \
    --accum-freq 4 \
    --force-image-size 336 \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --precision amp \
    --grad-checkpointing \
    --epochs 32
Effective batch size: 64 × 4 = 256

Learning Rate Adjustment

When using gradient accumulation, the effective batch size changes but the number of gradient steps remains the same per epoch. Generally: No learning rate adjustment needed when only changing --accum-freq However, if you’re matching a specific training recipe that used a different batch size:
# Original: batch_size=512, lr=1e-3
# New: batch_size=128, accum_freq=4 (same effective batch)
# Keep the same learning rate
--batch-size 128 \
--accum-freq 4 \
--lr 1e-3

Implementation Details

Forward Passes

With gradient accumulation, there are two forward passes per sample:
  1. First pass (with gradients): Computes loss and gradients
  2. Second pass (with torch.no_grad()): Computes features for contrastive loss
This is necessary for the contrastive learning objective in CLIP.

Loss Computation

The loss is computed accum_freq times before each weight update:
  • Each accumulated batch computes its own loss
  • Gradients are accumulated across all batches
  • Final gradient is the sum (effectively the mean due to normalization)

Memory Usage

Memory is used for:
  • Model weights and optimizer states
  • Gradients (accumulated across batches)
  • Features from all accum_freq batches
  • Current batch activations

Monitoring Training

Key metrics when using gradient accumulation:
# Samples per second remains constant
samples_per_second = accum_freq × batch_size × world_size / batch_time

# Steps per epoch decreases
steps_per_epoch = num_samples / (batch_size × accum_freq × world_size)

# Total samples seen is unchanged
total_samples = steps × batch_size × accum_freq × world_size

Compatibility

Works With:

  • Mixed precision training (--precision amp)
  • Gradient checkpointing (--grad-checkpointing)
  • Local loss (--local-loss)
  • Gather with gradients (--gather-with-grad)
  • Distributed training (multi-GPU)
  • All model architectures

Does Not Work With:

  • Model distillation (--distill-model) - requires --accum-freq 1

Best Practices

  1. Start Small: Test with --accum-freq 2 before using larger values
  2. Power of 2: Use powers of 2 for accum_freq (2, 4, 8) for better memory alignment
  3. Balance: Find the sweet spot between batch_size and accum_freq
  4. Memory First: Maximize batch_size before increasing accum_freq
  5. Monitor: Watch memory usage and training speed to find optimal settings
  6. Document: Record your effective batch size for reproducibility

Troubleshooting

Still Running Out of Memory

# Reduce batch size further
--batch-size 64 --accum-freq 8

# Enable gradient checkpointing
--grad-checkpointing

# Use lower precision
--precision fp16

# Reduce number of workers
--workers 4

Training is Too Slow

# Reduce accum_freq if possible
--accum-freq 2  # instead of 4

# Increase batch size
--batch-size 256  # instead of 128

# Enable amp for mixed precision
--precision amp

Unstable Training

# Increase warmup
--warmup 5000

# Adjust learning rate
--lr 5e-4  # reduce if unstable

# Add gradient clipping
--grad-clip-norm 1.0

References

For more information on gradient accumulation for contrastive learning:

Build docs developers (and LLMs) love