Overview
Optimizing LLM inference involves balancing multiple factors:- Batch size - Number of concurrent requests processed together
- KV cache sizing - Memory allocation for attention caching
- Quantization - Precision reduction to improve throughput
- Parallelism strategy - Distributing computation across GPUs
- Hardware utilization - Maximizing GPU compute and memory bandwidth
Reference Configurations
TensorRT-LLM provides 170+ pareto-optimized serving configurations in theexamples/configs/database/ directory. These configs are pre-tuned for:
- Multiple models (Llama, DeepSeek, Mixtral, GPT, etc.)
- Different GPU types (H100, A100, B200, etc.)
- Various ISL/OSL combinations (input/output sequence lengths)
- Different concurrency levels (1 to 2048+ concurrent requests)
Using Reference Configs
Thelookup.yaml file maps configurations to specific scenarios:
Example Reference Config
Here’s a production-optimized config for DeepSeek-R1 on B200 GPUs:max_batch_size: 512- Maximum concurrent requestskv_cache_config.dtype: fp8- FP8 KV cache for 2x memory savingskv_cache_config.free_gpu_memory_fraction: 0.8- Use 80% of GPU memorytensor_parallel_size: 8- Distribute model across 8 GPUsspeculative_config- Enable multi-token prediction for faster decoding
Batch Size Optimization
Batch size is the most critical parameter for throughput optimization.How Batch Size Affects Performance
Small Batches
Pros:
- Lower latency per request
- Better for interactive workloads
- Poor GPU utilization
- Lower overall throughput
Large Batches
Pros:
- Higher GPU utilization
- Maximum throughput
- Higher latency per request
- Requires more memory
Finding Optimal Batch Size
Start with Reference Config
Use the reference config for your model, GPU, and workload pattern as a baseline.
Experiment with Batch Sizes
Test increasing batch sizes until throughput plateaus or memory is exhausted:
KV Cache Optimization
The KV (Key-Value) cache stores attention states for generated tokens. Proper sizing is critical for performance.KV Cache Memory Usage
KV cache memory scales with:- Batch size - More concurrent requests = more cache
- Sequence length - Longer sequences = larger cache per request
- Model size - More layers/heads = more cache per token
- Data type - FP16 (2 bytes) vs FP8 (1 byte) vs FP4 (0.5 bytes)
Configuration Options
KV Cache Quantization
Reducing KV cache precision can dramatically increase capacity:| Data Type | Memory per Token | Capacity Gain | Accuracy Impact |
|---|---|---|---|
| FP16 | 2 bytes | 1x (baseline) | None |
| FP8 | 1 byte | 2x | Minimal (less than 1% degradation) |
| FP4 | 0.5 bytes | 4x | Small (~2-3% degradation) |
Automatic KV Cache Sizing
- Available GPU memory
- Model memory footprint
- Configured batch size and sequence length
Quantization Selection
Quantization reduces model size and increases throughput by using lower precision.Supported Quantization Methods
FP8 - Best Balance
FP8 - Best Balance
8-bit floating point quantizationBest for: Production deployments on modern GPUs
- Throughput: 1.5-2x vs FP16
- Memory: 2x reduction
- Accuracy: Less than 1% degradation
- Hardware: Requires Hopper GPUs (H100, H200) or newer
INT4/FP4 - Maximum Throughput
INT4/FP4 - Maximum Throughput
4-bit quantization
- Throughput: 2-3x vs FP16
- Memory: 4x reduction
- Accuracy: 2-5% degradation (model-dependent)
- Hardware: Broad GPU support
INT8 - Broad Compatibility
INT8 - Broad Compatibility
8-bit integer quantization
- Throughput: 1.3-1.8x vs FP16
- Memory: 2x reduction
- Accuracy: ~1-2% degradation
- Hardware: All modern NVIDIA GPUs
Quantization Recommendations by Model Size
| Model Size | Recommended Quantization | Rationale |
|---|---|---|
| < 10B params | FP8 or FP16 | Fits in memory, maximize accuracy |
| 10-70B params | FP8 | Best throughput/accuracy balance |
| 70B+ params | FP8 + FP8 KV cache | Critical for memory efficiency |
| 400B+ params | FP4/INT4 | Only way to fit on consumer hardware |
Parallelism Strategies
Distribute model computation across multiple GPUs.Tensor Parallelism (TP)
Splits individual layers across GPUs:- High communication overhead (all-reduce on each layer)
- Best for models that don’t fit on single GPU
- Scales well up to 8 GPUs, diminishing returns beyond that
Pipeline Parallelism (PP)
Splits model layers across GPUs:- Lower communication overhead
- Requires larger batch sizes for efficiency (fill pipeline)
- Can have GPU idle time (bubble overhead)
Expert Parallelism (EP)
For Mixture-of-Experts models, distribute experts across GPUs:Combining Strategies
Always ensure
tensor_parallel_size * pipeline_parallel_size = total_gpusHardware Utilization
CUDA Graphs
CUDA graphs reduce kernel launch overhead by recording execution patterns:- 10-20% latency reduction for small batches
- Most effective for batch sizes < 32
- Requires padding (slight memory overhead)
- Only works with fixed-size batches
Memory Management
Attention Kernels
TensorRT-LLM automatically selects optimal attention implementations:- Flash Attention - Memory-efficient for long sequences
- Paged Attention - Efficient KV cache management
- XQA - Optimized for Hopper GPUs
Speculative Decoding
Accelerate generation by predicting multiple tokens per iteration:- 1.5-2x speedup for compatible models
- No accuracy loss (predictions are verified)
- Model must have MTP/medusa heads
- Works best for predictable text generation
Performance Tuning Workflow
Establish Baseline
Start with reference config for your model/GPU/workload:Record throughput and latency metrics.
Tune Batch Size
Increase batch size until throughput plateaus:Stop when memory is exhausted or throughput stops improving.
Profile and Analyze
Common Pitfalls
Performance Checklist
- Used reference config as baseline
- Enabled FP8 quantization (if using Hopper GPUs)
- Configured FP8 KV cache
- Tuned batch size for workload
- Set appropriate memory fraction (0.8-0.9)
- Enabled CUDA graphs for low-latency scenarios
- Validated accuracy on representative samples
- Profiled to identify bottlenecks
Related Resources
Benchmarking
Measure performance with trtllm-bench
Profiling
Analyze performance with profiling tools
Reference Configs
Browse 170+ optimized configurations
Quantization
Deep dive into quantization methods