Skip to main content

Overview

Multi-node training allows you to scale CLIP training across multiple machines, enabling training on massive datasets with large models. OpenCLIP supports multi-node training through both native PyTorch distributed (torchrun) and SLURM cluster management.
OpenCLIP has been battle-tested on clusters with up to 1024 A100 GPUs, demonstrating robust scalability for large-scale training.

Prerequisites

  • Multiple machines with GPUs connected via high-bandwidth network
  • Network configuration allowing inter-node communication
  • Shared filesystem accessible from all nodes (recommended)
  • SLURM cluster (for SLURM-based training) or manual node coordination

Multi-Node with torchrun

Basic Setup

The torchrun launcher supports multi-node training with minimal configuration. The key is specifying the master node’s address and the number of nodes.
cd open_clip/src
torchrun --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=$NODE_RANK \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    -m open_clip_train.main \
    --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 4 \
    --imagenet-val /data/imagenet/validation/ \
    --model ViT-B-32 \
    --epochs 32
Key Parameters:
  • --nproc_per_node=4: Number of GPUs per node (4 in this example)
  • --nnodes=2: Total number of nodes
  • --node_rank=$NODE_RANK: Rank of current node (0 for master, 1, 2, … for workers)
  • --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT: Address of the master node

Environment Variables

Set these environment variables on each node: Master Node (Node 0):
export MASTER_ADDR=$(hostname -i)  # IP address of master node
export MASTER_PORT=29500           # Communication port
export NODE_RANK=0                 # Master node rank
Worker Nodes (Node 1, 2, …):
export MASTER_ADDR=<master-node-ip>  # IP of master node
export MASTER_PORT=29500              # Same port as master
export NODE_RANK=1                    # 1, 2, 3, ... for each worker

Complete Multi-Node Example

Here’s a complete example with 2 nodes, 4 GPUs each: On Master Node (192.168.1.10):
#!/bin/bash

export MASTER_ADDR=192.168.1.10
export MASTER_PORT=29500
export NODE_RANK=0

cd open_clip/src
torchrun --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=$NODE_RANK \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    -m open_clip_train.main \
    --train-data '/shared/data/laion400m/{00000..41455}.tar' \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 256 \
    --precision amp \
    --workers 6 \
    --warmup 2000 \
    --lr 5e-4 \
    --wd 0.2 \
    --epochs 32 \
    --model ViT-B-32 \
    --name "vit-b32-laion400m" \
    --local-loss \
    --gather-with-grad \
    --report-to wandb
On Worker Node (192.168.1.11):
#!/bin/bash

export MASTER_ADDR=192.168.1.10  # Master node IP
export MASTER_PORT=29500
export NODE_RANK=1               # Worker rank

cd open_clip/src
torchrun --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=$NODE_RANK \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    -m open_clip_train.main \
    # ... same arguments as master node ...

SLURM-Based Training

SLURM is the recommended approach for large-scale cluster training. It automatically handles node allocation, environment setup, and process launching.

Basic SLURM Script

#!/bin/bash -x
#SBATCH --nodes=32
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=6
#SBATCH --wait-all-nodes=1
#SBATCH --job-name=open_clip
#SBATCH --account=ACCOUNT_NAME
#SBATCH --partition=PARTITION_NAME

eval "$(/path/to/conda/bin/conda shell.bash hook)" # init conda
conda activate open_clip
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MASTER_PORT=12802

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr

cd /shared/open_clip
export PYTHONPATH="$PYTHONPATH:$PWD/src"
srun --cpu_bind=v --accel-bind=gn python -u src/open_clip_train/main.py \
    --save-frequency 1 \
    --report-to tensorboard \
    --train-data "/data/LAION-400M/{00000..41455}.tar" \
    --warmup 2000 \
    --batch-size 256 \
    --epochs 32 \
    --workers 8 \
    --model ViT-B-32 \
    --name "ViT-B-32-Vanilla" \
    --seed 0 \
    --local-loss \
    --gather-with-grad
SLURM Parameters:
  • --nodes=32: Number of nodes to allocate
  • --gres=gpu:4: Request 4 GPUs per node
  • --ntasks-per-node=4: Launch 4 tasks (1 per GPU) per node
  • --cpus-per-task=6: Allocate 6 CPU cores per task (for data loading)
  • --wait-all-nodes=1: Wait for all nodes to be ready before starting

Production SLURM Example

Here’s a production-ready SLURM script for training ViT-L/14 on LAION-400M:
#!/bin/bash -x
#SBATCH --nodes=64
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --wait-all-nodes=1
#SBATCH --job-name=clip_vit_l14
#SBATCH --time=7-00:00:00
#SBATCH --output=logs/slurm-%j.out
#SBATCH --error=logs/slurm-%j.err

# Initialize conda
eval "$(/path/to/conda/bin/conda shell.bash hook)"
conda activate open_clip

# Set environment variables
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MASTER_PORT=12802
export OMP_NUM_THREADS=1

# Get master node address
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr

echo "Master address: $MASTER_ADDR"
echo "Number of nodes: $SLURM_JOB_NUM_NODES"
echo "Tasks per node: $SLURM_NTASKS_PER_NODE"

# Change to project directory
cd /shared/open_clip
export PYTHONPATH="$PYTHONPATH:$PWD/src"

# Launch training
srun --cpu_bind=v --accel-bind=gn python -u src/open_clip_train/main.py \
    --save-frequency 1 \
    --zeroshot-frequency 2 \
    --report-to wandb \
    --wandb-project-name "clip-laion400m" \
    --train-data "/data/LAION-400M/{00000..41455}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --warmup 10000 \
    --batch-size 128 \
    --epochs 32 \
    --workers 8 \
    --model ViT-L-14 \
    --precision amp \
    --grad-checkpointing \
    --lr 5e-4 \
    --wd 0.2 \
    --name "vit-l14-laion400m" \
    --seed 42 \
    --local-loss \
    --gather-with-grad \
    --imagenet-val /data/imagenet/validation/

Submitting SLURM Jobs

# Submit job
sbatch train_clip.sh

# Check job status
squeue -u $USER

# Monitor output
tail -f logs/slurm-<job-id>.out

# Cancel job
scancel <job-id>

Network Configuration

Firewall Settings

Ensure communication ports are open between nodes:
# Allow PyTorch distributed communication (example for firewalld)
sudo firewall-cmd --zone=public --add-port=29500/tcp --permanent
sudo firewall-cmd --zone=public --add-port=29400-29600/tcp --permanent
sudo firewall-cmd --reload

Network Backend

Configure the distributed backend for your hardware: NVIDIA GPUs with NCCL (recommended):
# Default - no additional flags needed
# Uses NCCL automatically for GPU training
Ascend NPU:
--dist-backend hccl
CPU-only:
--dist-backend gloo

InfiniBand Optimization

For clusters with InfiniBand, optimize NCCL settings:
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5
export NCCL_SOCKET_IFNAME=ib0
export NCCL_DEBUG=INFO  # For debugging

Distributed Training Optimizations

Memory-Efficient Distributed Loss

For multi-node training, use these flags to reduce memory usage from O(n²) to O(n):
--local-loss \
--gather-with-grad
Without these flags:
  • Memory usage: O(batch_size × num_gpus)²
  • Example: 256 batch size × 128 GPUs = 8GB+ logit matrix
With these flags:
  • Memory usage: O(batch_size × num_gpus)
  • Same numerical results
  • Essential for large-scale training (64+ GPUs)
See Distributed Training for detailed explanation.

Gradient Accumulation

Simulate larger batch sizes across nodes:
--accum-freq 4  # Accumulate gradients over 4 steps
Effective batch size:
effective_batch = batch_size × num_gpus × num_nodes × accum_freq
              = 128 × 4 × 32 × 4
              = 65,536

Remote Checkpoint Syncing

For multi-node training, sync checkpoints to remote storage (S3, shared filesystem):
python -u src/open_clip_train/main.py \
    --logs /local/scratch/logs \
    --remote-sync s3://my-bucket/checkpoints \
    --remote-sync-frequency 300 \
    --delete-previous-checkpoint \
    # ... other arguments
Parameters:
  • --logs: Local checkpoint directory
  • --remote-sync: Remote path (s3:// or shared filesystem path)
  • --remote-sync-frequency 300: Sync every 300 seconds (5 minutes)
  • --delete-previous-checkpoint: Save disk space on local nodes

Resume from Remote Checkpoint

--resume s3://my-bucket/checkpoints/experiment/checkpoints/epoch_10.pt

SLURM Job Management

Interactive SLURM Session

For debugging, request interactive session:
srun --nodes=2 --gres=gpu:4 --ntasks-per-node=4 --pty bash

# Then run training commands manually

Monitor Job Progress

# Watch job queue
watch -n 5 squeue -u $USER

# Check node allocation
scontrol show job <job-id>

# View GPU usage on allocated nodes
srun --jobid=<job-id> --overlap nvidia-smi
#!/bin/bash
#SBATCH --nodes=8
#SBATCH --array=0-4

# Different learning rates for each job
LRS=(1e-3 5e-4 1e-4 5e-5 1e-5)
LR=${LRS[$SLURM_ARRAY_TASK_ID]}

srun python -u src/open_clip_train/main.py \
    --lr $LR \
    --name "experiment-lr-$LR" \
    # ... other arguments

Troubleshooting Multi-Node Training

Nodes Can’t Communicate

Symptom: Training hangs at initialization Solutions:
  1. Check firewall settings
  2. Verify MASTER_ADDR is reachable from all nodes:
    ping $MASTER_ADDR
    telnet $MASTER_ADDR $MASTER_PORT
    
  3. Check SLURM node allocation:
    scontrol show job $SLURM_JOB_ID
    

NCCL Initialization Errors

Symptom:
NCCL error: unhandled system error
Solutions:
  1. Enable NCCL debugging:
    export NCCL_DEBUG=INFO
    export NCCL_DEBUG_SUBSYS=ALL
    
  2. Check GPU visibility:
    nvidia-smi
    
  3. Verify InfiniBand configuration (if applicable)

Inconsistent Results Across Nodes

Symptom: Different nodes show different loss values Solutions:
  1. Ensure same code version on all nodes
  2. Check data is accessible from all nodes
  3. Verify --seed is set for reproducibility
  4. Use --wait-all-nodes=1 in SLURM

Out of Memory on Some Nodes

Symptom: OOM error on specific nodes Solutions:
  1. Check GPU memory is equal across nodes:
    srun nvidia-smi
    
  2. Use --grad-checkpointing for memory efficiency
  3. Reduce --batch-size per GPU
  4. Enable --local-loss --gather-with-grad

Performance Optimization

Network Bandwidth

Monitor network usage during training:
# On each node
iftop -i ib0  # For InfiniBand
iftop -i eth0 # For Ethernet
Target: High utilization during gradient synchronization

Scaling Efficiency

Measure scaling efficiency:
# Ideal scaling: 2x nodes = 2x throughput
scaling_efficiency = (throughput_N_nodes / throughput_1_node) / N_nodes

# Target: > 0.9 for good scaling
Tips for better scaling:
  • Use --local-loss --gather-with-grad
  • Ensure sufficient batch size per GPU (128-512)
  • Use WebDataset format
  • Optimize --workers for data loading

Benchmark Multi-Node Performance

# Run with different node counts and compare throughput
for NODES in 1 2 4 8 16; do
    sbatch --nodes=$NODES benchmark.sh
done

Example: Large-Scale Training Configuration

Training ViT-H/14 on LAION-2B with 256 GPUs (64 nodes × 4 GPUs):
#!/bin/bash -x
#SBATCH --nodes=64
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --wait-all-nodes=1
#SBATCH --time=14-00:00:00

eval "$(/path/to/conda/bin/conda shell.bash hook)"
conda activate open_clip

export CUDA_VISIBLE_DEVICES=0,1,2,3
export MASTER_PORT=12802
export NCCL_IB_DISABLE=0

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr

cd /shared/open_clip
export PYTHONPATH="$PYTHONPATH:$PWD/src"

srun --cpu_bind=v --accel-bind=gn python -u src/open_clip_train/main.py \
    --train-data "/data/LAION-2B/{00000..100000}.tar" \
    --train-num-samples 2000000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 256 \
    --precision amp \
    --grad-checkpointing \
    --workers 12 \
    --warmup 10000 \
    --lr 5e-4 \
    --wd 0.2 \
    --epochs 32 \
    --model ViT-H-14 \
    --save-frequency 1 \
    --zeroshot-frequency 2 \
    --local-loss \
    --gather-with-grad \
    --report-to wandb \
    --remote-sync s3://my-bucket/clip-checkpoints \
    --remote-sync-frequency 600 \
    --delete-previous-checkpoint

Next Steps

Distributed Training

Learn about advanced distributed training techniques

Configuration

Explore all training configuration options

Single-Node Training

Start with single-node training before scaling

Data Preparation

Prepare large-scale datasets for multi-node training

Build docs developers (and LLMs) love