Multi-node Distributed Training

Multi-node distributed training enables you to scale fine-tuning across multiple machines, allowing you to train larger models faster or handle massive datasets that don’t fit on a single machine.

Overview

Multi-node training distributes the workload across multiple servers, each with multiple GPUs:

Scale to larger models: Train Qwen-72B or larger on multiple machines
Faster training: Reduce training time through data parallelism
Handle large datasets: Distribute data across nodes
Cost efficiency: Use available cluster resources effectively

Multi-node training works with all fine-tuning methods: full-parameter, LoRA, and Q-LoRA.

Architecture

Single Node vs Multi-node

Single Node (1 machine with 4 GPUs):
┌──────────────────────────┐
│  Node 0 (master)            │
│  ┌──────────────────────┐  │
│  │ GPU0 GPU1 GPU2 GPU3   │  │
│  └──────────────────────┘  │
└──────────────────────────┘

Multi-node (2 machines with 4 GPUs each):
┌──────────────────────────┐    ┌──────────────────────────┐
│  Node 0 (master)            │    │  Node 1 (worker)            │
│  ┌──────────────────────┐  │    │  ┌──────────────────────┐  │
│  │ GPU0 GPU1 GPU2 GPU3   │  │↔↔↔│  │ GPU4 GPU5 GPU6 GPU7   │  │
│  └──────────────────────┘  │    │  └──────────────────────┘  │
└──────────────────────────┘    └──────────────────────────┘
       High-speed interconnect (InfiniBand/RoCE)

Prerequisites

Hardware Requirements

Network: High-speed interconnect required
- Recommended: InfiniBand (100+ Gbps) or RoCE
- Minimum: 10GbE Ethernet (performance degradation expected)
GPUs: Same GPU types across all nodes for best results
Storage: Shared filesystem (NFS) or synchronized data on each node

Software Requirements

# Install on all nodes
pip install -r requirements.txt
pip install deepspeed
pip install "peft<0.8.0"  # For LoRA/Q-LoRA

Network Configuration

Ensure SSH Connectivity

All nodes must be able to SSH to each other without password:

# On master node
ssh-keygen -t rsa
ssh-copy-id user@worker-node-1
ssh-copy-id user@worker-node-2

# Test connectivity
ssh user@worker-node-1 "hostname"
ssh user@worker-node-2 "hostname"

Configure Firewall

Open required ports:

# Master port (default 6001)
sudo ufw allow 6001/tcp

# NCCL ports (for GPU communication)
sudo ufw allow 1024:65535/tcp

Verify Network Speed

Test inter-node bandwidth:

# Install iperf3
sudo apt-get install iperf3

# On worker node
iperf3 -s

# On master node
iperf3 -c worker-node-1
# Should show: 10+ Gbps for acceptable performance

Multi-node Configuration

Environment Variables

Each node needs these environment variables set:

NNODES

int

required

Total number of nodes (machines) in the training cluster.

NODE_RANK

int

required

Rank of current node. Master node is 0, workers are 1, 2, 3, …

MASTER_ADDR

string

required

IP address or hostname of the master node (rank 0).

MASTER_PORT

int

required

Port for communication between nodes (default: 6001).

GPUS_PER_NODE

int

required

Number of GPUs on each node. Must be the same across all nodes.

Multi-node LoRA Training

Training Script

The finetune_lora_ds.sh script supports multi-node training:

finetune/finetune_lora_ds.sh

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1

# Multi-node configuration
GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')
NNODES=${NNODES:-1}           # Total number of nodes
NODE_RANK=${NODE_RANK:-0}     # This node's rank (0, 1, 2, ...)
MASTER_ADDR=${MASTER_ADDR:-localhost}
MASTER_PORT=${MASTER_PORT:-6001}

MODEL="Qwen/Qwen-7B-Chat"
DATA="path_to_data.json"
DS_CONFIG_PATH="finetune/ds_config_zero2.json"

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --bf16 True \
    --output_dir output_qwen \
    --num_train_epochs 5 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --learning_rate 3e-4 \
    --weight_decay 0.1 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --model_max_length 512 \
    --save_strategy "steps" \
    --save_steps 1000 \
    --lazy_preprocess True \
    --use_lora \
    --gradient_checkpointing \
    --deepspeed ${DS_CONFIG_PATH}

Launch Multi-node Training

Prepare Data on All Nodes

Ensure training data is accessible on all nodes:Option 1: Shared filesystem (NFS)

# Mount shared storage on all nodes
/mnt/shared/train_data.json

Option 2: Copy to each node

# On each node
scp master:/path/to/train_data.json /local/path/

Configure Master Node (Rank 0)

On the master node:

export NNODES=2
export NODE_RANK=0
export MASTER_ADDR=192.168.1.100  # Master node IP
export MASTER_PORT=6001

bash finetune/finetune_lora_ds.sh \
  -m Qwen/Qwen-7B-Chat \
  -d /mnt/shared/train_data.json

Configure Worker Nodes (Rank 1, 2, ...)

On worker node 1:

export NNODES=2
export NODE_RANK=1  # Increment for each worker
export MASTER_ADDR=192.168.1.100  # Same as master
export MASTER_PORT=6001

bash finetune/finetune_lora_ds.sh \
  -m Qwen/Qwen-7B-Chat \
  -d /mnt/shared/train_data.json

Monitor Training

Only the master node (rank 0) will print training logs:

[Rank 0] {'loss': 1.234, 'learning_rate': 0.0003, 'epoch': 0.1}
[Rank 0] {'loss': 0.987, 'learning_rate': 0.00029, 'epoch': 0.2}

Check all nodes are participating:

# On any node
nvidia-smi  # Should show GPU utilization ~90-100%

Multi-node Full-Parameter Training

Configuration

# On master node (rank 0)
export NNODES=4
export NODE_RANK=0
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=6001

bash finetune/finetune_ds.sh \
  -m Qwen/Qwen-14B \
  -d /shared/train_data.json

# On worker nodes (rank 1, 2, 3)
export NNODES=4
export NODE_RANK=1  # Change to 2, 3 for other workers
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=6001

bash finetune/finetune_ds.sh \
  -m Qwen/Qwen-14B \
  -d /shared/train_data.json

ZeRO Stage Selection

Important: DeepSpeed ZeRO-3 requires high inter-node bandwidth and significantly reduces training speed in multi-node setups.

ZeRO-2 (Recommended for multi-node):

Shards optimizer states and gradients
Lower communication overhead
Better multi-node performance
Use: finetune/ds_config_zero2.json

ZeRO-3 (Only for high-bandwidth interconnect):

Shards model parameters, gradients, and optimizer states
Maximum memory efficiency
High communication requirements (InfiniBand recommended)
Use: finetune/ds_config_zero3.json

# From README line 793
# Note: DeepSpeed ZeRO 3 requires much greater inter-node communication rate than ZeRO 2,
# which will significantly reduce the training speed in the case of multinode finetuning.
# Therefore, we do not recommend using DeepSpeed ZeRO 3 configurations in multinode finetuning scripts.

Multi-node Q-LoRA Training

# Master node
export NNODES=2
export NODE_RANK=0
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=6001

bash finetune/finetune_qlora_ds.sh \
  -m Qwen/Qwen-7B-Chat-Int4 \
  -d /shared/train_data.json

# Worker node
export NNODES=2
export NODE_RANK=1
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=6001

bash finetune/finetune_qlora_ds.sh \
  -m Qwen/Qwen-7B-Chat-Int4 \
  -d /shared/train_data.json

Q-LoRA with multi-node can train Qwen-72B on 4x single-GPU nodes (24GB each), whereas LoRA would require 80GB GPUs.

Performance Benchmarks

Qwen-7B LoRA Multi-node Performance

Setup: 2 nodes, 2x A100-80GB per node

Configuration	Training Speed	Speedup	Effective Batch Size
1 node, 2 GPUs	1.5s/iter	1.0x	32
2 nodes, 4 GPUs	0.8s/iter	1.9x	64

From README line 798-799: Multi-node training results show LoRA (multinode) configuration.

Scaling Efficiency

Nodes	GPUs	Ideal Speedup	Actual Speedup	Efficiency
1	2	1.0x	1.0x	100%
2	4	2.0x	1.9x	95%
4	8	4.0x	3.5x	88%
8	16	8.0x	6.4x	80%

Scaling efficiency decreases with more nodes due to communication overhead. Best efficiency with high-bandwidth interconnect.

Troubleshooting

Workers Not Connecting to Master

Symptoms: Workers hang at initialization, no training startsDebug steps:

Verify master IP is correct:

# On master
hostname -I

Test connectivity from workers:

# On worker
ping $MASTER_ADDR
telnet $MASTER_ADDR $MASTER_PORT

Check firewall:

sudo ufw status
sudo ufw allow $MASTER_PORT/tcp

Verify NODE_RANK is unique per node:

echo $NODE_RANK  # Should be 0, 1, 2, ... (unique)

Training Extremely Slow

Possible causes:

Low network bandwidth:

# Test with iperf3
iperf3 -c worker-node-1
# Should show >10 Gbps for acceptable performance

Using ZeRO-3 (not recommended for multi-node):

# Switch to ZeRO-2
--deepspeed finetune/ds_config_zero2.json

Network congestion:

Ensure dedicated network for training
Disable other network-intensive tasks

Small batch size:

# Increase to reduce communication overhead
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4

Nodes Out of Sync / Diverging Loss

Symptoms: Different loss values reported, training unstableSolutions:

Ensure same data on all nodes:

# Verify data file checksum
md5sum /path/to/train_data.json

Same random seed:

--seed 42

Synchronized model loading:

Use shared checkpoint directory
Or ensure same model downloaded on all nodes

Check NCCL errors in logs:

export NCCL_DEBUG=INFO

NCCL Timeout Errors

Issue: NCCL timeout or NCCL communication failedSolutions:

Increase timeout:

export NCCL_TIMEOUT=1800  # 30 minutes

Check network configuration:

# Verify NCCL can detect network
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0  # Enable InfiniBand
export NCCL_SOCKET_IFNAME=eth0  # Specify interface

Use correct NCCL backend:

# For InfiniBand
export NCCL_IB_DISABLE=0

# For Ethernet
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0

Out of Memory on Some Nodes

Cause: Uneven data distribution or node configurationSolutions:

Ensure same GPU types across nodes
Verify same number of GPUs per node
Balance data distribution:

# Check data shard sizes
--dataloader_num_workers 4

Reduce batch size:

--per_device_train_batch_size 1

Best Practices

Network Optimization

Optimize inter-node communication:

# Enable NCCL optimizations
export NCCL_DEBUG=WARN  # Only show warnings (less overhead)
export NCCL_IB_DISABLE=0  # Enable InfiniBand
export NCCL_IB_GID_INDEX=3  # RoCE configuration
export NCCL_SOCKET_IFNAME=eth0  # Specify network interface

# For Ethernet
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0

Data Management

Shared filesystem (recommended):

Use NFS or distributed filesystem (e.g., CephFS)
Single source of truth for data and checkpoints
Automatic synchronization

Replicated data (alternative):

Copy data to local storage on each node
Faster I/O (no network overhead)
Must manually synchronize updates

Checkpoint Strategy

# Save checkpoints to shared storage
--output_dir /shared/checkpoints/qwen_experiment_1 \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 3  # Keep only last 3 checkpoints (saves space)

Only the master node (rank 0) saves checkpoints. Worker nodes will wait during checkpoint saving.

Monitoring Multi-node Training

On master node:

# Watch training logs
tail -f output_qwen/training.log

# Monitor GPU usage
watch -n 1 nvidia-smi

On worker nodes:

# Verify GPU utilization (should be ~90-100%)
nvidia-smi

# Check for errors in logs
tail -f nohup.out

Advanced: Launching with Job Schedulers

SLURM

slurm_train.sh

#!/bin/bash
#SBATCH --job-name=qwen_multinode
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --time=24:00:00

export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
export MASTER_PORT=6001
export NNODES=$SLURM_JOB_NUM_NODES
export NODE_RANK=$SLURM_NODEID
export GPUS_PER_NODE=$SLURM_GPUS_PER_NODE

srun bash finetune/finetune_lora_ds.sh \
  -m Qwen/Qwen-14B-Chat \
  -d /shared/train_data.json

Submit job:

sbatch slurm_train.sh

PBS/Torque

pbs_train.sh

#!/bin/bash
#PBS -N qwen_multinode
#PBS -l nodes=4:ppn=8:gpus=8
#PBS -l walltime=24:00:00

export MASTER_ADDR=$(head -n 1 $PBS_NODEFILE)
export MASTER_PORT=6001
export NNODES=$(cat $PBS_NODEFILE | sort -u | wc -l)

# Launch on each node
for NODE in $(cat $PBS_NODEFILE | sort -u); do
  export NODE_RANK=$i
  ssh $NODE "cd $(pwd) && bash finetune/finetune_lora_ds.sh -m Qwen/Qwen-14B-Chat -d /shared/train_data.json" &
  i=$((i+1))
done

wait

Performance Tuning

Optimal Batch Size

Adjust for best throughput:

# Effective batch size formula
effective_batch_size = (
    per_device_train_batch_size × 
    gradient_accumulation_steps × 
    num_gpus × 
    num_nodes
)

# Example: 2 nodes, 4 GPUs per node
# per_device=2, grad_accum=4
# Effective = 2 × 4 × 4 × 2 = 64

Target effective batch size of 64-256 for LoRA fine-tuning. Adjust gradient accumulation to balance memory and throughput.

Communication Optimization

# Reduce communication frequency
--gradient_accumulation_steps 16  # More local computation

# Enable gradient compression (experimental)
export NCCL_COMPRESSION=1

# Overlap communication with computation
# (automatic with DeepSpeed ZeRO)

Cost Optimization

Use Spot/Preemptible Instances

# Enable checkpoint resuming
--resume_from_checkpoint /shared/checkpoints/checkpoint-5000 \
--save_strategy "steps" \
--save_steps 500  # Frequent saves for fault tolerance

Mix Node Types

Not recommended: Mixing different GPU types can cause:

Load imbalance
Synchronization issues
Reduced performance

If necessary, group same GPU types together.

Next Steps

LoRA Fine-tuning

Understand LoRA for efficient multi-node training

Data Preparation

Prepare data for distributed training

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

​Overview

​Architecture

​Single Node vs Multi-node

​Prerequisites

​Hardware Requirements

​Software Requirements

​Network Configuration

​Multi-node Configuration

​Environment Variables

​Multi-node LoRA Training

​Training Script

​Launch Multi-node Training

​Multi-node Full-Parameter Training

​Configuration

​ZeRO Stage Selection

​Multi-node Q-LoRA Training

​Performance Benchmarks

​Qwen-7B LoRA Multi-node Performance

​Scaling Efficiency

​Troubleshooting

​Best Practices

​Network Optimization

​Data Management

​Checkpoint Strategy

​Monitoring Multi-node Training

​Advanced: Launching with Job Schedulers

​SLURM

​PBS/Torque

​Performance Tuning

​Optimal Batch Size

​Communication Optimization

​Cost Optimization

​Use Spot/Preemptible Instances

​Mix Node Types

​Next Steps

LoRA Fine-tuning

Data Preparation

Build docs developers (and LLMs) love

Overview

Architecture

Single Node vs Multi-node

Prerequisites

Hardware Requirements

Software Requirements

Network Configuration

Multi-node Configuration

Environment Variables

Multi-node LoRA Training

Training Script

Launch Multi-node Training

Multi-node Full-Parameter Training

Configuration

ZeRO Stage Selection

Multi-node Q-LoRA Training

Performance Benchmarks

Qwen-7B LoRA Multi-node Performance

Scaling Efficiency

Troubleshooting

Best Practices

Network Optimization

Data Management

Checkpoint Strategy

Monitoring Multi-node Training

Advanced: Launching with Job Schedulers

SLURM

PBS/Torque

Performance Tuning

Optimal Batch Size

Communication Optimization

Cost Optimization

Use Spot/Preemptible Instances

Mix Node Types

Next Steps