Multi-node distributed training enables you to scale fine-tuning across multiple machines, allowing you to train larger models faster or handle massive datasets that don’t fit on a single machine.
Overview
Multi-node training distributes the workload across multiple servers, each with multiple GPUs:
Scale to larger models : Train Qwen-72B or larger on multiple machines
Faster training : Reduce training time through data parallelism
Handle large datasets : Distribute data across nodes
Cost efficiency : Use available cluster resources effectively
Multi-node training works with all fine-tuning methods: full-parameter, LoRA, and Q-LoRA.
Architecture
Single Node vs Multi-node
Single Node (1 machine with 4 GPUs):
┌──────────────────────────┐
│ Node 0 (master) │
│ ┌──────────────────────┐ │
│ │ GPU0 GPU1 GPU2 GPU3 │ │
│ └──────────────────────┘ │
└──────────────────────────┘
Multi-node (2 machines with 4 GPUs each):
┌──────────────────────────┐ ┌──────────────────────────┐
│ Node 0 (master) │ │ Node 1 (worker) │
│ ┌──────────────────────┐ │ │ ┌──────────────────────┐ │
│ │ GPU0 GPU1 GPU2 GPU3 │ │↔↔↔│ │ GPU4 GPU5 GPU6 GPU7 │ │
│ └──────────────────────┘ │ │ └──────────────────────┘ │
└──────────────────────────┘ └──────────────────────────┘
High-speed interconnect (InfiniBand/RoCE)
Prerequisites
Hardware Requirements
Network : High-speed interconnect required
Recommended : InfiniBand (100+ Gbps) or RoCE
Minimum : 10GbE Ethernet (performance degradation expected)
GPUs : Same GPU types across all nodes for best results
Storage : Shared filesystem (NFS) or synchronized data on each node
Software Requirements
# Install on all nodes
pip install -r requirements.txt
pip install deepspeed
pip install "peft<0.8.0" # For LoRA/Q-LoRA
Network Configuration
Ensure SSH Connectivity
All nodes must be able to SSH to each other without password: # On master node
ssh-keygen -t rsa
ssh-copy-id user@worker-node-1
ssh-copy-id user@worker-node-2
# Test connectivity
ssh user@worker-node-1 "hostname"
ssh user@worker-node-2 "hostname"
Configure Firewall
Open required ports: # Master port (default 6001)
sudo ufw allow 6001/tcp
# NCCL ports (for GPU communication)
sudo ufw allow 1024:65535/tcp
Verify Network Speed
Test inter-node bandwidth: # Install iperf3
sudo apt-get install iperf3
# On worker node
iperf3 -s
# On master node
iperf3 -c worker-node-1
# Should show: 10+ Gbps for acceptable performance
Multi-node Configuration
Environment Variables
Each node needs these environment variables set:
Total number of nodes (machines) in the training cluster.
Rank of current node. Master node is 0, workers are 1, 2, 3, …
IP address or hostname of the master node (rank 0).
Port for communication between nodes (default: 6001).
Number of GPUs on each node. Must be the same across all nodes.
Multi-node LoRA Training
Training Script
The finetune_lora_ds.sh script supports multi-node training:
finetune/finetune_lora_ds.sh
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS = 1
# Multi-node configuration
GPUS_PER_NODE = $( python -c 'import torch; print(torch.cuda.device_count())' )
NNODES = ${ NNODES :- 1 } # Total number of nodes
NODE_RANK = ${ NODE_RANK :- 0 } # This node's rank (0, 1, 2, ...)
MASTER_ADDR = ${ MASTER_ADDR :- localhost }
MASTER_PORT = ${ MASTER_PORT :- 6001 }
MODEL = "Qwen/Qwen-7B-Chat"
DATA = "path_to_data.json"
DS_CONFIG_PATH = "finetune/ds_config_zero2.json"
DISTRIBUTED_ARGS = "
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py \
--model_name_or_path $MODEL \
--data_path $DATA \
--bf16 True \
--output_dir output_qwen \
--num_train_epochs 5 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--learning_rate 3e-4 \
--weight_decay 0.1 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--model_max_length 512 \
--save_strategy "steps" \
--save_steps 1000 \
--lazy_preprocess True \
--use_lora \
--gradient_checkpointing \
--deepspeed ${ DS_CONFIG_PATH }
Launch Multi-node Training
Prepare Data on All Nodes
Ensure training data is accessible on all nodes: Option 1: Shared filesystem (NFS) # Mount shared storage on all nodes
/mnt/shared/train_data.json
Option 2: Copy to each node # On each node
scp master:/path/to/train_data.json /local/path/
Configure Master Node (Rank 0)
On the master node: export NNODES = 2
export NODE_RANK = 0
export MASTER_ADDR = 192.168.1.100 # Master node IP
export MASTER_PORT = 6001
bash finetune/finetune_lora_ds.sh \
-m Qwen/Qwen-7B-Chat \
-d /mnt/shared/train_data.json
Configure Worker Nodes (Rank 1, 2, ...)
On worker node 1: export NNODES = 2
export NODE_RANK = 1 # Increment for each worker
export MASTER_ADDR = 192.168.1.100 # Same as master
export MASTER_PORT = 6001
bash finetune/finetune_lora_ds.sh \
-m Qwen/Qwen-7B-Chat \
-d /mnt/shared/train_data.json
Monitor Training
Only the master node (rank 0) will print training logs: [Rank 0] {'loss': 1.234, 'learning_rate': 0.0003, 'epoch': 0.1}
[Rank 0] {'loss': 0.987, 'learning_rate': 0.00029, 'epoch': 0.2}
Check all nodes are participating: # On any node
nvidia-smi # Should show GPU utilization ~90-100%
Multi-node Full-Parameter Training
Configuration
# On master node (rank 0)
export NNODES = 4
export NODE_RANK = 0
export MASTER_ADDR = 192.168.1.100
export MASTER_PORT = 6001
bash finetune/finetune_ds.sh \
-m Qwen/Qwen-14B \
-d /shared/train_data.json
# On worker nodes (rank 1, 2, 3)
export NNODES = 4
export NODE_RANK = 1 # Change to 2, 3 for other workers
export MASTER_ADDR = 192.168.1.100
export MASTER_PORT = 6001
bash finetune/finetune_ds.sh \
-m Qwen/Qwen-14B \
-d /shared/train_data.json
ZeRO Stage Selection
Important : DeepSpeed ZeRO-3 requires high inter-node bandwidth and significantly reduces training speed in multi-node setups.
ZeRO-2 (Recommended for multi-node) :
Shards optimizer states and gradients
Lower communication overhead
Better multi-node performance
Use: finetune/ds_config_zero2.json
ZeRO-3 (Only for high-bandwidth interconnect) :
Shards model parameters, gradients, and optimizer states
Maximum memory efficiency
High communication requirements (InfiniBand recommended)
Use: finetune/ds_config_zero3.json
# From README line 793
# Note: DeepSpeed ZeRO 3 requires much greater inter-node communication rate than ZeRO 2,
# which will significantly reduce the training speed in the case of multinode finetuning.
# Therefore, we do not recommend using DeepSpeed ZeRO 3 configurations in multinode finetuning scripts.
Multi-node Q-LoRA Training
# Master node
export NNODES = 2
export NODE_RANK = 0
export MASTER_ADDR = 192.168.1.100
export MASTER_PORT = 6001
bash finetune/finetune_qlora_ds.sh \
-m Qwen/Qwen-7B-Chat-Int4 \
-d /shared/train_data.json
# Worker node
export NNODES = 2
export NODE_RANK = 1
export MASTER_ADDR = 192.168.1.100
export MASTER_PORT = 6001
bash finetune/finetune_qlora_ds.sh \
-m Qwen/Qwen-7B-Chat-Int4 \
-d /shared/train_data.json
Q-LoRA with multi-node can train Qwen-72B on 4x single-GPU nodes (24GB each), whereas LoRA would require 80GB GPUs.
Setup : 2 nodes, 2x A100-80GB per node
Configuration Training Speed Speedup Effective Batch Size 1 node, 2 GPUs 1.5s/iter 1.0x 32 2 nodes, 4 GPUs 0.8s/iter 1.9x 64
From README line 798-799: Multi-node training results show LoRA (multinode) configuration.
Scaling Efficiency
Nodes GPUs Ideal Speedup Actual Speedup Efficiency 1 2 1.0x 1.0x 100% 2 4 2.0x 1.9x 95% 4 8 4.0x 3.5x 88% 8 16 8.0x 6.4x 80%
Scaling efficiency decreases with more nodes due to communication overhead. Best efficiency with high-bandwidth interconnect.
Troubleshooting
Workers Not Connecting to Master
Symptoms : Workers hang at initialization, no training startsDebug steps :
Verify master IP is correct:
Test connectivity from workers:
# On worker
ping $MASTER_ADDR
telnet $MASTER_ADDR $MASTER_PORT
Check firewall:
sudo ufw status
sudo ufw allow $MASTER_PORT /tcp
Verify NODE_RANK is unique per node:
echo $NODE_RANK # Should be 0, 1, 2, ... (unique)
Possible causes :
Low network bandwidth :
# Test with iperf3
iperf3 -c worker-node-1
# Should show >10 Gbps for acceptable performance
Using ZeRO-3 (not recommended for multi-node) :
# Switch to ZeRO-2
--deepspeed finetune/ds_config_zero2.json
Network congestion :
Ensure dedicated network for training
Disable other network-intensive tasks
Small batch size :
# Increase to reduce communication overhead
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4
Nodes Out of Sync / Diverging Loss
Symptoms : Different loss values reported, training unstableSolutions :
Ensure same data on all nodes:
# Verify data file checksum
md5sum /path/to/train_data.json
Same random seed:
Synchronized model loading:
Use shared checkpoint directory
Or ensure same model downloaded on all nodes
Check NCCL errors in logs:
Issue : NCCL timeout or NCCL communication failedSolutions :
Increase timeout:
export NCCL_TIMEOUT = 1800 # 30 minutes
Check network configuration:
# Verify NCCL can detect network
export NCCL_DEBUG = INFO
export NCCL_IB_DISABLE = 0 # Enable InfiniBand
export NCCL_SOCKET_IFNAME = eth0 # Specify interface
Use correct NCCL backend:
# For InfiniBand
export NCCL_IB_DISABLE = 0
# For Ethernet
export NCCL_IB_DISABLE = 1
export NCCL_SOCKET_IFNAME = eth0
Out of Memory on Some Nodes
Cause : Uneven data distribution or node configurationSolutions :
Ensure same GPU types across nodes
Verify same number of GPUs per node
Balance data distribution:
# Check data shard sizes
--dataloader_num_workers 4
Reduce batch size:
--per_device_train_batch_size 1
Best Practices
Network Optimization
Optimize inter-node communication:
# Enable NCCL optimizations
export NCCL_DEBUG = WARN # Only show warnings (less overhead)
export NCCL_IB_DISABLE = 0 # Enable InfiniBand
export NCCL_IB_GID_INDEX = 3 # RoCE configuration
export NCCL_SOCKET_IFNAME = eth0 # Specify network interface
# For Ethernet
export NCCL_IB_DISABLE = 1
export NCCL_SOCKET_IFNAME = eth0
Data Management
Shared filesystem (recommended):
Use NFS or distributed filesystem (e.g., CephFS)
Single source of truth for data and checkpoints
Automatic synchronization
Replicated data (alternative):
Copy data to local storage on each node
Faster I/O (no network overhead)
Must manually synchronize updates
Checkpoint Strategy
# Save checkpoints to shared storage
--output_dir /shared/checkpoints/qwen_experiment_1 \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 3 # Keep only last 3 checkpoints (saves space)
Only the master node (rank 0) saves checkpoints. Worker nodes will wait during checkpoint saving.
Monitoring Multi-node Training
On master node:
# Watch training logs
tail -f output_qwen/training.log
# Monitor GPU usage
watch -n 1 nvidia-smi
On worker nodes:
# Verify GPU utilization (should be ~90-100%)
nvidia-smi
# Check for errors in logs
tail -f nohup.out
Advanced: Launching with Job Schedulers
SLURM
#!/bin/bash
#SBATCH --job-name=qwen_multinode
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --time=24:00:00
export MASTER_ADDR = $( scontrol show hostname $SLURM_NODELIST | head -n 1 )
export MASTER_PORT = 6001
export NNODES = $SLURM_JOB_NUM_NODES
export NODE_RANK = $SLURM_NODEID
export GPUS_PER_NODE = $SLURM_GPUS_PER_NODE
srun bash finetune/finetune_lora_ds.sh \
-m Qwen/Qwen-14B-Chat \
-d /shared/train_data.json
Submit job:
PBS/Torque
#!/bin/bash
#PBS -N qwen_multinode
#PBS -l nodes=4:ppn=8:gpus=8
#PBS -l walltime=24:00:00
export MASTER_ADDR = $( head -n 1 $PBS_NODEFILE )
export MASTER_PORT = 6001
export NNODES = $( cat $PBS_NODEFILE | sort -u | wc -l )
# Launch on each node
for NODE in $( cat $PBS_NODEFILE | sort -u ); do
export NODE_RANK = $i
ssh $NODE "cd $( pwd ) && bash finetune/finetune_lora_ds.sh -m Qwen/Qwen-14B-Chat -d /shared/train_data.json" &
i = $(( i+1 ))
done
wait
Optimal Batch Size
Adjust for best throughput:
# Effective batch size formula
effective_batch_size = (
per_device_train_batch_size ×
gradient_accumulation_steps ×
num_gpus ×
num_nodes
)
# Example: 2 nodes, 4 GPUs per node
# per_device=2, grad_accum=4
# Effective = 2 × 4 × 4 × 2 = 64
Target effective batch size of 64-256 for LoRA fine-tuning. Adjust gradient accumulation to balance memory and throughput.
Communication Optimization
# Reduce communication frequency
--gradient_accumulation_steps 16 # More local computation
# Enable gradient compression (experimental)
export NCCL_COMPRESSION = 1
# Overlap communication with computation
# (automatic with DeepSpeed ZeRO)
Cost Optimization
Use Spot/Preemptible Instances
# Enable checkpoint resuming
--resume_from_checkpoint /shared/checkpoints/checkpoint-5000 \
--save_strategy "steps" \
--save_steps 500 # Frequent saves for fault tolerance
Mix Node Types
Not recommended : Mixing different GPU types can cause:
Load imbalance
Synchronization issues
Reduced performance
If necessary, group same GPU types together.
Next Steps
LoRA Fine-tuning Understand LoRA for efficient multi-node training
Data Preparation Prepare data for distributed training