Skip to main content
This guide covers performance tuning strategies to maximize throughput and minimize latency when deploying TensorRT-LLM models.

Overview

Optimizing LLM inference involves balancing multiple factors:
  • Batch size - Number of concurrent requests processed together
  • KV cache sizing - Memory allocation for attention caching
  • Quantization - Precision reduction to improve throughput
  • Parallelism strategy - Distributing computation across GPUs
  • Hardware utilization - Maximizing GPU compute and memory bandwidth

Reference Configurations

TensorRT-LLM provides 170+ pareto-optimized serving configurations in the examples/configs/database/ directory. These configs are pre-tuned for:
  • Multiple models (Llama, DeepSeek, Mixtral, GPT, etc.)
  • Different GPU types (H100, A100, B200, etc.)
  • Various ISL/OSL combinations (input/output sequence lengths)
  • Different concurrency levels (1 to 2048+ concurrent requests)
Always start with reference configs as your baseline. These configurations have been profiled and optimized for specific workload patterns.

Using Reference Configs

The lookup.yaml file maps configurations to specific scenarios:
- model: deepseek-ai/DeepSeek-R1-0528
  arch: DeepseekV3ForCausalLM
  gpu: B200_NVL
  isl: 1024
  osl: 1024
  concurrency: 16
  config_path: examples/configs/database/deepseek-ai/DeepSeek-R1-0528/B200/1k1k_tp8_conc16.yaml
  num_gpus: 8

Example Reference Config

Here’s a production-optimized config for DeepSeek-R1 on B200 GPUs:
max_batch_size: 512
cuda_graph_config:
  enable_padding: true
  max_batch_size: 16
print_iter_log: true
kv_cache_config:
  dtype: fp8
  free_gpu_memory_fraction: 0.8
stream_interval: 10
moe_config:
  backend: TRTLLM
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 3
tensor_parallel_size: 8
moe_expert_parallel_size: 8
trust_remote_code: true
max_num_tokens: 3136
max_seq_len: 2068
Key parameters:
  • max_batch_size: 512 - Maximum concurrent requests
  • kv_cache_config.dtype: fp8 - FP8 KV cache for 2x memory savings
  • kv_cache_config.free_gpu_memory_fraction: 0.8 - Use 80% of GPU memory
  • tensor_parallel_size: 8 - Distribute model across 8 GPUs
  • speculative_config - Enable multi-token prediction for faster decoding

Batch Size Optimization

Batch size is the most critical parameter for throughput optimization.

How Batch Size Affects Performance

Small Batches

Pros:
  • Lower latency per request
  • Better for interactive workloads
Cons:
  • Poor GPU utilization
  • Lower overall throughput

Large Batches

Pros:
  • Higher GPU utilization
  • Maximum throughput
Cons:
  • Higher latency per request
  • Requires more memory

Finding Optimal Batch Size

1

Start with Reference Config

Use the reference config for your model, GPU, and workload pattern as a baseline.
2

Measure Current Performance

Benchmark with your actual workload:
trtllm-bench --model meta-llama/Llama-3.1-8B \
  throughput \
  --dataset your_dataset.jsonl \
  --backend pytorch \
  --config baseline.yaml
3

Experiment with Batch Sizes

Test increasing batch sizes until throughput plateaus or memory is exhausted:
# Test configurations
max_batch_size: 64   # baseline
max_batch_size: 128  # 2x increase
max_batch_size: 256  # 4x increase
max_batch_size: 512  # 8x increase
4

Monitor Memory Usage

Check GPU memory during benchmarks:
nvidia-smi dmon -s mu
Setting max_batch_size too high can cause OOM errors. Always monitor GPU memory usage and leave headroom for KV cache growth.

KV Cache Optimization

The KV (Key-Value) cache stores attention states for generated tokens. Proper sizing is critical for performance.

KV Cache Memory Usage

KV cache memory scales with:
  • Batch size - More concurrent requests = more cache
  • Sequence length - Longer sequences = larger cache per request
  • Model size - More layers/heads = more cache per token
  • Data type - FP16 (2 bytes) vs FP8 (1 byte) vs FP4 (0.5 bytes)

Configuration Options

kv_cache_config:
  # Data type for KV cache
  dtype: fp8  # Options: auto, fp16, fp8
  
  # GPU memory fraction reserved for KV cache
  free_gpu_memory_fraction: 0.8  # Use 80% of available memory
  
  # Enable KV cache reuse across requests
  enable_block_reuse: true

KV Cache Quantization

Reducing KV cache precision can dramatically increase capacity:
Data TypeMemory per TokenCapacity GainAccuracy Impact
FP162 bytes1x (baseline)None
FP81 byte2xMinimal (less than 1% degradation)
FP40.5 bytes4xSmall (~2-3% degradation)
Recommendation: Start with FP8 KV cache - it provides 2x memory savings with negligible accuracy loss.

Automatic KV Cache Sizing

kv_cache_config:
  free_gpu_memory_fraction: 0.9  # Auto-size to use 90% of free memory
TensorRT-LLM automatically calculates optimal KV cache blocks based on:
  • Available GPU memory
  • Model memory footprint
  • Configured batch size and sequence length

Quantization Selection

Quantization reduces model size and increases throughput by using lower precision.

Supported Quantization Methods

8-bit floating point quantization
  • Throughput: 1.5-2x vs FP16
  • Memory: 2x reduction
  • Accuracy: Less than 1% degradation
  • Hardware: Requires Hopper GPUs (H100, H200) or newer
# Use pre-quantized checkpoint
trtllm-bench --model nvidia/Llama-3.1-8B-Instruct-FP8 \
  throughput --dataset dataset.jsonl --backend pytorch
Best for: Production deployments on modern GPUs
4-bit quantization
  • Throughput: 2-3x vs FP16
  • Memory: 4x reduction
  • Accuracy: 2-5% degradation (model-dependent)
  • Hardware: Broad GPU support
Best for: Maximum throughput when accuracy trade-offs are acceptable
8-bit integer quantization
  • Throughput: 1.3-1.8x vs FP16
  • Memory: 2x reduction
  • Accuracy: ~1-2% degradation
  • Hardware: All modern NVIDIA GPUs
Best for: Deployments on older GPU architectures (Ampere/Turing)

Quantization Recommendations by Model Size

Model SizeRecommended QuantizationRationale
< 10B paramsFP8 or FP16Fits in memory, maximize accuracy
10-70B paramsFP8Best throughput/accuracy balance
70B+ paramsFP8 + FP8 KV cacheCritical for memory efficiency
400B+ paramsFP4/INT4Only way to fit on consumer hardware

Parallelism Strategies

Distribute model computation across multiple GPUs.

Tensor Parallelism (TP)

Splits individual layers across GPUs:
tensor_parallel_size: 8  # Split model across 8 GPUs
Characteristics:
  • High communication overhead (all-reduce on each layer)
  • Best for models that don’t fit on single GPU
  • Scales well up to 8 GPUs, diminishing returns beyond that
Use when: Model memory exceeds single GPU capacity

Pipeline Parallelism (PP)

Splits model layers across GPUs:
pipeline_parallel_size: 4  # Distribute layers across 4 GPUs
Characteristics:
  • Lower communication overhead
  • Requires larger batch sizes for efficiency (fill pipeline)
  • Can have GPU idle time (bubble overhead)
Use when: You have many layers and large batch sizes

Expert Parallelism (EP)

For Mixture-of-Experts models, distribute experts across GPUs:
moe_expert_parallel_size: 8  # Distribute experts across 8 GPUs
tensor_parallel_size: 1      # Can combine with TP
Use when: Deploying MoE models (Mixtral, DeepSeek-V3, etc.)

Combining Strategies

# Example: 32 GPU deployment of 405B model
tensor_parallel_size: 8
pipeline_parallel_size: 4
# Total: 8 * 4 = 32 GPUs
Always ensure tensor_parallel_size * pipeline_parallel_size = total_gpus

Hardware Utilization

CUDA Graphs

CUDA graphs reduce kernel launch overhead by recording execution patterns:
cuda_graph_config:
  enable_padding: true       # Pad batches to fixed sizes
  max_batch_size: 16         # Maximum batch size for graphs
Benefits:
  • 10-20% latency reduction for small batches
  • Most effective for batch sizes < 32
Trade-offs:
  • Requires padding (slight memory overhead)
  • Only works with fixed-size batches

Memory Management

# Maximize GPU memory utilization
kv_cache_config:
  free_gpu_memory_fraction: 0.9  # Use 90% of memory
  
# Enable memory-efficient features
enable_chunked_context: true      # Process long contexts in chunks
max_num_tokens: 8192              # Maximum tokens per iteration

Attention Kernels

TensorRT-LLM automatically selects optimal attention implementations:
  • Flash Attention - Memory-efficient for long sequences
  • Paged Attention - Efficient KV cache management
  • XQA - Optimized for Hopper GPUs
These are auto-selected based on your hardware and model config. No manual tuning needed.

Speculative Decoding

Accelerate generation by predicting multiple tokens per iteration:
speculative_config:
  decoding_type: MTP                    # Multi-Token Prediction
  num_nextn_predict_layers: 3           # Number of prediction layers
Benefits:
  • 1.5-2x speedup for compatible models
  • No accuracy loss (predictions are verified)
Requirements:
  • Model must have MTP/medusa heads
  • Works best for predictable text generation

Performance Tuning Workflow

1

Establish Baseline

Start with reference config for your model/GPU/workload:
trtllm-bench --model meta-llama/Llama-3.1-70B \
  throughput \
  --dataset dataset.jsonl \
  --config examples/configs/database/.../baseline.yaml
Record throughput and latency metrics.
2

Enable FP8 Quantization

Use pre-quantized checkpoint:
trtllm-bench --model nvidia/Llama-3.1-70B-Instruct-FP8 \
  throughput --dataset dataset.jsonl
Expect 1.5-2x throughput improvement.
3

Optimize KV Cache

Enable FP8 KV cache:
kv_cache_config:
  dtype: fp8
  free_gpu_memory_fraction: 0.85
Monitor for memory savings and capacity increase.
4

Tune Batch Size

Increase batch size until throughput plateaus:
max_batch_size: 128  # Start here
max_batch_size: 256  # Double it
max_batch_size: 512  # Keep going
Stop when memory is exhausted or throughput stops improving.
5

Profile and Analyze

Use profiling tools to identify bottlenecks:
TLLM_PROFILE_START_STOP=10-50 nsys profile \
  -o trace -t cuda,nvtx -c cudaProfilerApi \
  trtllm-bench ...
See Profiling Guide for details.

Common Pitfalls

OOM Errors: Reduce max_batch_size or free_gpu_memory_fraction if you hit out-of-memory errors.
Low Throughput: If throughput is low despite high batch size, check GPU utilization with nvidia-smi. Low utilization may indicate CPU bottlenecks or inefficient batching.
High Latency: If individual request latency is too high, reduce batch size or enable streaming to get time-to-first-token improvements.

Performance Checklist

  • Used reference config as baseline
  • Enabled FP8 quantization (if using Hopper GPUs)
  • Configured FP8 KV cache
  • Tuned batch size for workload
  • Set appropriate memory fraction (0.8-0.9)
  • Enabled CUDA graphs for low-latency scenarios
  • Validated accuracy on representative samples
  • Profiled to identify bottlenecks

Benchmarking

Measure performance with trtllm-bench

Profiling

Analyze performance with profiling tools

Reference Configs

Browse 170+ optimized configurations

Quantization

Deep dive into quantization methods

Build docs developers (and LLMs) love