Skip to main content

Quickstart Guide

This guide will get you up and running with TensorRT-LLM in minutes. You’ll learn how to deploy a model for online serving and run offline inference using the Python API.

Prerequisites

Before you begin, ensure you have:
  • NVIDIA GPU with compute capability 7.0+ (Volta, Turing, Ampere, Hopper, or Blackwell)
  • NVIDIA Driver version 535+ installed
  • Docker with NVIDIA Container Toolkit installed
  • 8GB+ GPU memory (for TinyLlama example; larger models require more)
If you don’t have Docker installed, follow the NVIDIA Container Toolkit installation guide.

Launch Docker Container

1

Pull and run the TensorRT-LLM container

The TensorRT-LLM container comes with all dependencies pre-installed. Start the container with GPU access:
docker run --rm -it \
  --ipc host \
  --gpus all \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc6
Replace 1.3.0rc6 with the latest version tag from the NGC catalog.
The -p 8000:8000 flag exposes port 8000 for the serving API.
2

Verify the installation

Once inside the container, verify TensorRT-LLM is installed:
python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
You should see the version number printed (e.g., 1.3.0rc6).

Option 1: Online Serving with trtllm-serve

The fastest way to deploy a model is using trtllm-serve, which provides an OpenAI-compatible API.
1

Start the server

Launch a model server using trtllm-serve. This example uses TinyLlama for quick testing:
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
The server will:
  • Download the model from Hugging Face (on first run)
  • Optimize the model for your GPU
  • Start an OpenAI-compatible HTTP server on port 8000
The first run may take a few minutes to download and optimize the model. Subsequent runs are much faster as the optimized model is cached.
2

Send a test request

Open a new terminal and attach to the running container:
docker exec -it <container_id> bash
Or from outside the container (if you exposed port 8000), send a curl request:
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'
Example Response:
{
  "id": "chatcmpl-ef648e7489c040679d87ed12db5d3214",
  "object": "chat.completion",
  "created": 1741966075,
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris. It is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "total_tokens": 58,
    "completion_tokens": 30
  }
}
3

Try a quantized model (optional)

For better performance, use a pre-quantized FP8 model:
trtllm-serve "nvidia/Qwen3-8B-FP8"
FP8 quantization requires Hopper GPUs (H100, H200) or newer. For other GPUs, use standard models or INT4/INT8 quantization.
Browse more pre-optimized models in the NVIDIA Model Optimizer collection.

Streaming Responses

To enable streaming (useful for chat applications), add "stream": true to your request:
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "max_tokens": 200,
    "stream": true
  }'
You’ll receive Server-Sent Events (SSE) with tokens as they’re generated.

Option 2: Offline Inference with LLM API

For batch processing or integration into Python applications, use the LLM API directly.
1

Create a Python script

Create a file called quickstart.py with the following code:
quickstart.py
from tensorrt_llm import LLM, SamplingParams

# Initialize the LLM with a HuggingFace model
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Sample prompts
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

# Configure sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Generate responses
for output in llm.generate(prompts, sampling_params):
    print(f"Prompt: {output.prompt!r}")
    print(f"Generated: {output.outputs[0].text!r}")
    print("-" * 80)
This code:
  • Loads TinyLlama from Hugging Face
  • Defines three sample prompts
  • Generates completions with temperature sampling
  • Prints the results
2

Run the script

Execute the script inside the Docker container:
python3 quickstart.py
Example Output:
Prompt: 'Hello, my name is'
Generated: ' John and I am a software engineer. I have been working on...'
--------------------------------------------------------------------------------
Prompt: 'The capital of France is'
Generated: 'Paris.'
--------------------------------------------------------------------------------
Prompt: 'The future of AI is'
Generated: 'an exciting time for us. We are constantly researching and developing...'
--------------------------------------------------------------------------------
3

Try async generation (optional)

For concurrent request processing, use the async API:
async_example.py
import asyncio
from tensorrt_llm import LLM, SamplingParams

async def main():
    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
    
    prompts = [
        "What is machine learning?",
        "Explain quantum computing",
        "What is the theory of relativity?",
    ]
    
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)
    
    # Generate all prompts concurrently
    async def generate_one(prompt):
        output = await llm.generate_async(prompt, sampling_params)
        return output.outputs[0].text
    
    results = await asyncio.gather(*[generate_one(p) for p in prompts])
    
    for prompt, result in zip(prompts, results):
        print(f"Q: {prompt}")
        print(f"A: {result}\n")

if __name__ == "__main__":
    asyncio.run(main())

Loading Quantized Models

Load pre-quantized models directly from Hugging Face for optimal performance:
from tensorrt_llm import LLM

# Load FP8-quantized Qwen model
llm = LLM(model="nvidia/Qwen3-8B-FP8")

response = llm.generate(
    "Explain quantum entanglement",
    max_tokens=150
)
print(response[0].outputs[0].text)
Quantized models require significantly less GPU memory and can achieve 2-4x higher throughput with minimal accuracy loss.

Customizing Generation

The SamplingParams class provides fine-grained control over text generation:
from tensorrt_llm import SamplingParams

# Conservative sampling for factual responses
factual_params = SamplingParams(
    temperature=0.1,
    top_p=0.9,
    max_tokens=100
)

# Creative sampling for storytelling
creative_params = SamplingParams(
    temperature=1.0,
    top_p=0.95,
    top_k=50,
    max_tokens=500
)

# Greedy decoding (deterministic)
greedy_params = SamplingParams(
    temperature=0.0,
    max_tokens=100
)

# With repetition penalty
no_repeat_params = SamplingParams(
    temperature=0.8,
    repetition_penalty=1.2,
    max_tokens=200
)

Next Steps

Congratulations! You’ve successfully run your first TensorRT-LLM inference. Here’s what to explore next:

Installation Options

Learn about pip installation, building from source, and system requirements

Advanced Examples

Explore speculative decoding, multi-GPU inference, LoRA adapters, and more

Model Support

Check which models are supported and how to add custom models

Performance Tuning

Learn how to benchmark and optimize performance for your workload

Common Issues

If you encounter CUDA out of memory errors:
  1. Use a smaller model (e.g., TinyLlama instead of Llama-70B)
  2. Reduce batch size with max_batch_size parameter
  3. Use quantized models (FP8, INT4) to reduce memory footprint
  4. Enable KV cache offloading for long sequences
from tensorrt_llm import LLM, KvCacheConfig

llm = LLM(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    kv_cache_config=KvCacheConfig(free_gpu_memory_fraction=0.5)
)
If model download from Hugging Face fails:
  1. Check your internet connection
  2. Set HuggingFace token if the model requires authentication:
    export HF_TOKEN="your_token_here"
    
  3. Download the model manually and pass the local path:
    llm = LLM(model="/path/to/local/model")
    
If you can’t connect to the trtllm-serve server:
  1. Ensure port 8000 is exposed with -p 8000:8000 in docker run
  2. Check the server started successfully (no errors in logs)
  3. Try connecting from inside the container first:
    curl http://localhost:8000/health
    
  4. Check firewall settings if connecting from another machine
For more troubleshooting help, check the FAQ or open an issue on GitHub.

Build docs developers (and LLMs) love