Skip to main content
Multi-node distributed training enables you to scale fine-tuning across multiple machines, allowing you to train larger models faster or handle massive datasets that don’t fit on a single machine.

Overview

Multi-node training distributes the workload across multiple servers, each with multiple GPUs:
  • Scale to larger models: Train Qwen-72B or larger on multiple machines
  • Faster training: Reduce training time through data parallelism
  • Handle large datasets: Distribute data across nodes
  • Cost efficiency: Use available cluster resources effectively
Multi-node training works with all fine-tuning methods: full-parameter, LoRA, and Q-LoRA.

Architecture

Single Node vs Multi-node

Single Node (1 machine with 4 GPUs):
┌──────────────────────────┐
│  Node 0 (master)            │
│  ┌──────────────────────┐  │
│  │ GPU0 GPU1 GPU2 GPU3   │  │
│  └──────────────────────┘  │
└──────────────────────────┘

Multi-node (2 machines with 4 GPUs each):
┌──────────────────────────┐    ┌──────────────────────────┐
│  Node 0 (master)            │    │  Node 1 (worker)            │
│  ┌──────────────────────┐  │    │  ┌──────────────────────┐  │
│  │ GPU0 GPU1 GPU2 GPU3   │  │↔↔↔│  │ GPU4 GPU5 GPU6 GPU7   │  │
│  └──────────────────────┘  │    │  └──────────────────────┘  │
└──────────────────────────┘    └──────────────────────────┘
       High-speed interconnect (InfiniBand/RoCE)

Prerequisites

Hardware Requirements

  • Network: High-speed interconnect required
    • Recommended: InfiniBand (100+ Gbps) or RoCE
    • Minimum: 10GbE Ethernet (performance degradation expected)
  • GPUs: Same GPU types across all nodes for best results
  • Storage: Shared filesystem (NFS) or synchronized data on each node

Software Requirements

# Install on all nodes
pip install -r requirements.txt
pip install deepspeed
pip install "peft<0.8.0"  # For LoRA/Q-LoRA

Network Configuration

1

Ensure SSH Connectivity

All nodes must be able to SSH to each other without password:
# On master node
ssh-keygen -t rsa
ssh-copy-id user@worker-node-1
ssh-copy-id user@worker-node-2

# Test connectivity
ssh user@worker-node-1 "hostname"
ssh user@worker-node-2 "hostname"
2

Configure Firewall

Open required ports:
# Master port (default 6001)
sudo ufw allow 6001/tcp

# NCCL ports (for GPU communication)
sudo ufw allow 1024:65535/tcp
3

Verify Network Speed

Test inter-node bandwidth:
# Install iperf3
sudo apt-get install iperf3

# On worker node
iperf3 -s

# On master node
iperf3 -c worker-node-1
# Should show: 10+ Gbps for acceptable performance

Multi-node Configuration

Environment Variables

Each node needs these environment variables set:
NNODES
int
required
Total number of nodes (machines) in the training cluster.
NODE_RANK
int
required
Rank of current node. Master node is 0, workers are 1, 2, 3, …
MASTER_ADDR
string
required
IP address or hostname of the master node (rank 0).
MASTER_PORT
int
required
Port for communication between nodes (default: 6001).
GPUS_PER_NODE
int
required
Number of GPUs on each node. Must be the same across all nodes.

Multi-node LoRA Training

Training Script

The finetune_lora_ds.sh script supports multi-node training:
finetune/finetune_lora_ds.sh
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1

# Multi-node configuration
GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')
NNODES=${NNODES:-1}           # Total number of nodes
NODE_RANK=${NODE_RANK:-0}     # This node's rank (0, 1, 2, ...)
MASTER_ADDR=${MASTER_ADDR:-localhost}
MASTER_PORT=${MASTER_PORT:-6001}

MODEL="Qwen/Qwen-7B-Chat"
DATA="path_to_data.json"
DS_CONFIG_PATH="finetune/ds_config_zero2.json"

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --bf16 True \
    --output_dir output_qwen \
    --num_train_epochs 5 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --learning_rate 3e-4 \
    --weight_decay 0.1 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --model_max_length 512 \
    --save_strategy "steps" \
    --save_steps 1000 \
    --lazy_preprocess True \
    --use_lora \
    --gradient_checkpointing \
    --deepspeed ${DS_CONFIG_PATH}

Launch Multi-node Training

1

Prepare Data on All Nodes

Ensure training data is accessible on all nodes:Option 1: Shared filesystem (NFS)
# Mount shared storage on all nodes
/mnt/shared/train_data.json
Option 2: Copy to each node
# On each node
scp master:/path/to/train_data.json /local/path/
2

Configure Master Node (Rank 0)

On the master node:
export NNODES=2
export NODE_RANK=0
export MASTER_ADDR=192.168.1.100  # Master node IP
export MASTER_PORT=6001

bash finetune/finetune_lora_ds.sh \
  -m Qwen/Qwen-7B-Chat \
  -d /mnt/shared/train_data.json
3

Configure Worker Nodes (Rank 1, 2, ...)

On worker node 1:
export NNODES=2
export NODE_RANK=1  # Increment for each worker
export MASTER_ADDR=192.168.1.100  # Same as master
export MASTER_PORT=6001

bash finetune/finetune_lora_ds.sh \
  -m Qwen/Qwen-7B-Chat \
  -d /mnt/shared/train_data.json
4

Monitor Training

Only the master node (rank 0) will print training logs:
[Rank 0] {'loss': 1.234, 'learning_rate': 0.0003, 'epoch': 0.1}
[Rank 0] {'loss': 0.987, 'learning_rate': 0.00029, 'epoch': 0.2}
Check all nodes are participating:
# On any node
nvidia-smi  # Should show GPU utilization ~90-100%

Multi-node Full-Parameter Training

Configuration

# On master node (rank 0)
export NNODES=4
export NODE_RANK=0
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=6001

bash finetune/finetune_ds.sh \
  -m Qwen/Qwen-14B \
  -d /shared/train_data.json
# On worker nodes (rank 1, 2, 3)
export NNODES=4
export NODE_RANK=1  # Change to 2, 3 for other workers
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=6001

bash finetune/finetune_ds.sh \
  -m Qwen/Qwen-14B \
  -d /shared/train_data.json

ZeRO Stage Selection

Important: DeepSpeed ZeRO-3 requires high inter-node bandwidth and significantly reduces training speed in multi-node setups.
ZeRO-2 (Recommended for multi-node):
  • Shards optimizer states and gradients
  • Lower communication overhead
  • Better multi-node performance
  • Use: finetune/ds_config_zero2.json
ZeRO-3 (Only for high-bandwidth interconnect):
  • Shards model parameters, gradients, and optimizer states
  • Maximum memory efficiency
  • High communication requirements (InfiniBand recommended)
  • Use: finetune/ds_config_zero3.json
# From README line 793
# Note: DeepSpeed ZeRO 3 requires much greater inter-node communication rate than ZeRO 2,
# which will significantly reduce the training speed in the case of multinode finetuning.
# Therefore, we do not recommend using DeepSpeed ZeRO 3 configurations in multinode finetuning scripts.

Multi-node Q-LoRA Training

# Master node
export NNODES=2
export NODE_RANK=0
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=6001

bash finetune/finetune_qlora_ds.sh \
  -m Qwen/Qwen-7B-Chat-Int4 \
  -d /shared/train_data.json
# Worker node
export NNODES=2
export NODE_RANK=1
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=6001

bash finetune/finetune_qlora_ds.sh \
  -m Qwen/Qwen-7B-Chat-Int4 \
  -d /shared/train_data.json
Q-LoRA with multi-node can train Qwen-72B on 4x single-GPU nodes (24GB each), whereas LoRA would require 80GB GPUs.

Performance Benchmarks

Qwen-7B LoRA Multi-node Performance

Setup: 2 nodes, 2x A100-80GB per node
ConfigurationTraining SpeedSpeedupEffective Batch Size
1 node, 2 GPUs1.5s/iter1.0x32
2 nodes, 4 GPUs0.8s/iter1.9x64
From README line 798-799: Multi-node training results show LoRA (multinode) configuration.

Scaling Efficiency

NodesGPUsIdeal SpeedupActual SpeedupEfficiency
121.0x1.0x100%
242.0x1.9x95%
484.0x3.5x88%
8168.0x6.4x80%
Scaling efficiency decreases with more nodes due to communication overhead. Best efficiency with high-bandwidth interconnect.

Troubleshooting

Symptoms: Workers hang at initialization, no training startsDebug steps:
  1. Verify master IP is correct:
# On master
hostname -I
  1. Test connectivity from workers:
# On worker
ping $MASTER_ADDR
telnet $MASTER_ADDR $MASTER_PORT
  1. Check firewall:
sudo ufw status
sudo ufw allow $MASTER_PORT/tcp
  1. Verify NODE_RANK is unique per node:
echo $NODE_RANK  # Should be 0, 1, 2, ... (unique)
Possible causes:
  1. Low network bandwidth:
# Test with iperf3
iperf3 -c worker-node-1
# Should show >10 Gbps for acceptable performance
  1. Using ZeRO-3 (not recommended for multi-node):
# Switch to ZeRO-2
--deepspeed finetune/ds_config_zero2.json
  1. Network congestion:
  • Ensure dedicated network for training
  • Disable other network-intensive tasks
  1. Small batch size:
# Increase to reduce communication overhead
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4
Symptoms: Different loss values reported, training unstableSolutions:
  1. Ensure same data on all nodes:
# Verify data file checksum
md5sum /path/to/train_data.json
  1. Same random seed:
--seed 42
  1. Synchronized model loading:
  • Use shared checkpoint directory
  • Or ensure same model downloaded on all nodes
  1. Check NCCL errors in logs:
export NCCL_DEBUG=INFO
Issue: NCCL timeout or NCCL communication failedSolutions:
  1. Increase timeout:
export NCCL_TIMEOUT=1800  # 30 minutes
  1. Check network configuration:
# Verify NCCL can detect network
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0  # Enable InfiniBand
export NCCL_SOCKET_IFNAME=eth0  # Specify interface
  1. Use correct NCCL backend:
# For InfiniBand
export NCCL_IB_DISABLE=0

# For Ethernet
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0
Cause: Uneven data distribution or node configurationSolutions:
  1. Ensure same GPU types across nodes
  2. Verify same number of GPUs per node
  3. Balance data distribution:
# Check data shard sizes
--dataloader_num_workers 4
  1. Reduce batch size:
--per_device_train_batch_size 1

Best Practices

Network Optimization

Optimize inter-node communication:
# Enable NCCL optimizations
export NCCL_DEBUG=WARN  # Only show warnings (less overhead)
export NCCL_IB_DISABLE=0  # Enable InfiniBand
export NCCL_IB_GID_INDEX=3  # RoCE configuration
export NCCL_SOCKET_IFNAME=eth0  # Specify network interface

# For Ethernet
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0

Data Management

Shared filesystem (recommended):
  • Use NFS or distributed filesystem (e.g., CephFS)
  • Single source of truth for data and checkpoints
  • Automatic synchronization
Replicated data (alternative):
  • Copy data to local storage on each node
  • Faster I/O (no network overhead)
  • Must manually synchronize updates

Checkpoint Strategy

# Save checkpoints to shared storage
--output_dir /shared/checkpoints/qwen_experiment_1 \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 3  # Keep only last 3 checkpoints (saves space)
Only the master node (rank 0) saves checkpoints. Worker nodes will wait during checkpoint saving.

Monitoring Multi-node Training

On master node:
# Watch training logs
tail -f output_qwen/training.log

# Monitor GPU usage
watch -n 1 nvidia-smi
On worker nodes:
# Verify GPU utilization (should be ~90-100%)
nvidia-smi

# Check for errors in logs
tail -f nohup.out

Advanced: Launching with Job Schedulers

SLURM

slurm_train.sh
#!/bin/bash
#SBATCH --job-name=qwen_multinode
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --time=24:00:00

export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
export MASTER_PORT=6001
export NNODES=$SLURM_JOB_NUM_NODES
export NODE_RANK=$SLURM_NODEID
export GPUS_PER_NODE=$SLURM_GPUS_PER_NODE

srun bash finetune/finetune_lora_ds.sh \
  -m Qwen/Qwen-14B-Chat \
  -d /shared/train_data.json
Submit job:
sbatch slurm_train.sh

PBS/Torque

pbs_train.sh
#!/bin/bash
#PBS -N qwen_multinode
#PBS -l nodes=4:ppn=8:gpus=8
#PBS -l walltime=24:00:00

export MASTER_ADDR=$(head -n 1 $PBS_NODEFILE)
export MASTER_PORT=6001
export NNODES=$(cat $PBS_NODEFILE | sort -u | wc -l)

# Launch on each node
for NODE in $(cat $PBS_NODEFILE | sort -u); do
  export NODE_RANK=$i
  ssh $NODE "cd $(pwd) && bash finetune/finetune_lora_ds.sh -m Qwen/Qwen-14B-Chat -d /shared/train_data.json" &
  i=$((i+1))
done

wait

Performance Tuning

Optimal Batch Size

Adjust for best throughput:
# Effective batch size formula
effective_batch_size = (
    per_device_train_batch_size × 
    gradient_accumulation_steps × 
    num_gpus × 
    num_nodes
)

# Example: 2 nodes, 4 GPUs per node
# per_device=2, grad_accum=4
# Effective = 2 × 4 × 4 × 2 = 64
Target effective batch size of 64-256 for LoRA fine-tuning. Adjust gradient accumulation to balance memory and throughput.

Communication Optimization

# Reduce communication frequency
--gradient_accumulation_steps 16  # More local computation

# Enable gradient compression (experimental)
export NCCL_COMPRESSION=1

# Overlap communication with computation
# (automatic with DeepSpeed ZeRO)

Cost Optimization

Use Spot/Preemptible Instances

# Enable checkpoint resuming
--resume_from_checkpoint /shared/checkpoints/checkpoint-5000 \
--save_strategy "steps" \
--save_steps 500  # Frequent saves for fault tolerance

Mix Node Types

Not recommended: Mixing different GPU types can cause:
  • Load imbalance
  • Synchronization issues
  • Reduced performance
If necessary, group same GPU types together.

Next Steps

LoRA Fine-tuning

Understand LoRA for efficient multi-node training

Data Preparation

Prepare data for distributed training

Build docs developers (and LLMs) love