Parallelism Types
Tensor Parallel (TP)
Shards model weights across GPUs. Best for small batch sizes and memory-constrained scenarios.
Pipeline Parallel (PP)
Distributes model layers across GPUs. Best for large models that don’t fit in single GPU memory.
Data Parallel (DP)
Replicates model across GPUs for different requests. Best for large batch sizes and high throughput.
Expert Parallel (EP)
Distributes experts across GPUs for MoE models. Best for models with high expert count.
Context Parallel (CP)
Distributes context processing across GPUs. Best for long context scenarios.
Wide Expert Parallel
Advanced EP with load balancing for large-scale MoE models. Best for DeepSeek-V3/R1, LLaMA4, Qwen3.
Attention Module Parallelism
Tensor Parallelism for Attention
- Configuration
- How It Works
- Best For
Data Parallelism for Attention
- Configuration
- How It Works
- Best For
FFN Module Parallelism
Dense Models
For dense (non-MoE) models, tensor parallelism is supported:Mixture of Experts (MoE)
MoE models replace a single FFN with multiple experts. TensorRT-LLM supports three execution patterns:- TP (Tensor Parallel)
- EP (Expert Parallel)
- Hybrid ETP
- Every expert’s weight matrix is sliced across all GPUs
- Each GPU sees all tokens
- Higher communication overhead
- Better load balancing
Wide Expert Parallelism (Wide-EP)
Wide-EP is TensorRT-LLM’s advanced solution for large-scale MoE model inference, addressing workload imbalance through intelligent load balancing.Motivation
Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, and Qwen3 introduce challenges:- High memory demands for expert weights
- Inherent expert-level workload imbalance due to sparse execution
- Communication overhead in distributed expert parallelism
- Hot expert problem where certain experts receive significantly more tokens
Key Features
Expert Replication and Load Balancing
Expert Replication and Load Balancing
Wide-EP introduces expert slots decoupled from specific experts:
- Multiple replicas of hot experts across different GPUs
- Dynamic expert placement based on workload patterns
- Both offline and online load balancing strategies
Custom EP Communication Kernels
Custom EP Communication Kernels
- Optimized for NVIDIA GB200 Multi-Node NVLink (MNNVL)
- Efficient all-to-all communication for expert dispatch and combine
- Reduced communication overhead vs traditional EP
Expert Parallelism Load Balancer (EPLB)
Expert Parallelism Load Balancer (EPLB)
- Offline EPLB: Pre-computed expert placement based on historical workload statistics
- Online EPLB: Dynamic expert placement that adapts to real-time traffic patterns
- Layer-wise weight redistribution to minimize inference disruption
Architecture Overview
Wide-EP separates experts (model perspective) from slots (engine perspective):- Same expert to be replicated in multiple slots
- Dynamic remapping based on workload
- Load balancing without model retraining
Best Practices
Start with offline EPLB
For production deployments with known workload patterns, use offline EPLB to pre-compute optimal expert placement.
Use online EPLB for dynamic workloads
When traffic patterns change frequently or are unpredictable, enable online EPLB for real-time adaptation.
Monitor expert statistics
Track which experts receive the most tokens to understand workload distribution and validate load balancing effectiveness.
Tune max_num_tokens
Balance memory constraints and EP size by adjusting the maximum number of tokens per expert.
For detailed implementation examples and advanced usage, see:
examples/wide_ep/: Complete Wide-EP examplesexamples/wide_ep/ep_load_balancer/: Load balancing toolsexamples/wide_ep/slurm_scripts/: Cluster deployment scripts
Practical Configuration Examples
Single Node Deployment
- 8xH100 - LLaMA 70B
- 8xH100 - Mixtral 8x7B
- 8xH100 - High Throughput
Multi-Node Deployment
- 16xH100 (2 Nodes) - LLaMA 405B
- 32xH100 (4 Nodes) - DeepSeek-V3
Benchmarking Parallelism Strategies
Test different parallelism configurations:For optimal performance, consult the reference configs database which contains 170+ pareto-optimized configurations across multiple models and GPUs.
Performance Tuning Guide
Choosing TP vs DP for Attention
Choosing TP vs DP for Attention
Use TP when:
- Batch size is small (1-8)
- Model doesn’t fit in single GPU memory
- Low latency is critical
- Batch size is large (16+)
- Model fits in single GPU memory
- High throughput is the goal
MoE Parallelism Selection
MoE Parallelism Selection
Use pure TP when:
- Expert count is low (8-16 experts)
- Load is balanced across experts
- Maximum kernel efficiency is needed
- Expert count is high (64+ experts)
- Memory per GPU is limited
- Communication bandwidth is high
- Balancing between TP and EP benefits
- Moderate expert count (16-64)
- Need workload and kernel efficiency balance
Wide-EP Considerations
Wide-EP Considerations
- Only use for large-scale MoE models (DeepSeek-V3, LLaMA4, Qwen3)
- Monitor expert hit rates to identify hot experts
- Start with offline EPLB, migrate to online if workload varies
- Ensure high-bandwidth interconnect (NVLink, InfiniBand)
Additional Resources
Wide-EP Technical Blog
Deep dive into Wide Expert Parallelism
DeepSeek-V3 Paper
Research paper on large-scale MoE architecture
EPLB Implementation
Expert Parallelism Load Balancer reference
Reference Configs
170+ optimized serving configurations