SGLang supports various quantization methods to reduce memory usage and increase throughput. Quantization converts model weights from high-precision formats (BF16/FP16) to lower-precision formats (INT8/FP8/INT4/FP4).
Offline quantization is recommended over online quantization for better performance, usability, and convenience.
Quantization Types
Offline Quantization Load pre-quantized model weights. Required for GPTQ, AWQ, and optimal for FP8/FP4.
Online Quantization Dynamically quantize weights at runtime. Convenient but slower startup and higher memory usage.
Offline vs Online
Aspect Offline Online Startup time Fast Slow (quantization on startup) Memory usage Low High (during quantization) Quality control Can be validated before deployment Limited pre-deployment validation Preparation Requires quantization step No preparation needed
Offline Quantization
Load pre-quantized models directly. The quantization method is automatically detected from the model configuration.
Basic Usage
# Load pre-quantized model (quantization auto-detected)
python -m sglang.launch_server \
--model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
--port 30000
Do NOT add --quantization when loading pre-quantized models. The quantization method is parsed from the model config.
Per-Channel Quantization
For per-channel quantized models (INT8/FP8) with per-token dynamic quantization, you can optionally specify --quantization to use sgl-kernel instead of vLLM kernels:
python -m sglang.launch_server \
--model-path neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic \
--quantization w8a8_fp8 # Use sgl-kernel FP8 kernel
Unsloth (Recommended)
We strongly recommend Unsloth for quantization and deployment.
NVIDIA ModelOpt
NVIDIA ModelOpt provides advanced quantization optimized for NVIDIA hardware.
Quick Start
# Install ModelOpt
pip install nvidia-modelopt
# Quantize and export
python examples/usage/modelopt_quantize_and_export.py quantize \
--model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--export-dir ./quantized_tinyllama_fp8 \
--quantization-method modelopt_fp8
# Deploy
python -m sglang.launch_server \
--model-path ./quantized_tinyllama_fp8 \
--quantization modelopt \
--port 30000
Available Methods
FP8 modelopt_fp8 - Optimal on NVIDIA Hopper and Blackwell GPUs
FP4 modelopt_fp4 - Optimal on NVIDIA Blackwell GPUs
Python API
import sglang as sgl
from sglang.srt.configs.device_config import DeviceConfig
from sglang.srt.configs.load_config import LoadConfig
from sglang.srt.configs.model_config import ModelConfig
from sglang.srt.model_loader.loader import get_model_loader
# Configure model with quantization
model_config = ModelConfig(
model_path = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" ,
quantization = "modelopt_fp8" ,
trust_remote_code = True ,
)
load_config = LoadConfig(
modelopt_export_path = "./exported_model" ,
modelopt_checkpoint_save_path = "./checkpoint.pth" , # Optional
)
device_config = DeviceConfig( device = "cuda" )
# Load and quantize
model_loader = get_model_loader(load_config, model_config)
quantized_model = model_loader.load_model(
model_config = model_config,
device_config = device_config,
)
Pre-Quantized Models
Load existing pre-quantized ModelOpt models:
# FP8 model
python -m sglang.launch_server \
--model-path nvidia/Llama-3.1-8B-Instruct-FP8 \
--quantization modelopt_fp8
# FP4 model
python -m sglang.launch_server \
--model-path nvidia/Llama-3.3-70B-Instruct-NVFP4 \
--quantization modelopt_fp4
auto-round
Supports multiple quantization formats and both LLMs and VLMs.
LLM Quantization
from auto_round import AutoRound
model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-autoround-4bit"
# Schemes: W2A16, W3A16, W4A16, W8A16, NVFP4, MXFP4, GGUF:Q4_K_M, etc.
scheme = "W4A16"
format = "auto_round"
autoround = AutoRound(model_id, scheme = scheme)
autoround.quantize_and_save(quant_path, format = format )
VLM Quantization
from auto_round import AutoRoundMLLM
model_name = "Qwen/Qwen2-VL-2B-Instruct"
quant_path = "Qwen2-VL-2B-Instruct-autoround-4bit"
autoround = AutoRoundMLLM(model_name, scheme = "W4A16" )
autoround.quantize_and_save(quant_path, format = "auto_round" )
Command Line
auto-round \
--model meta-llama/Llama-3.2-1B-Instruct \
--bits 4 \
--group_size 128 \
--format "auto_round" \
--output_dir ./tmp_autoround
GPTQModel
pip install gptqmodel --no-build-isolation -v
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"
# Load calibration dataset
calibration_dataset = load_dataset(
"allenai/c4" ,
data_files = "en/c4-train.00001-of-01024.json.gz" ,
split = "train"
).select( range ( 1024 ))[ "text" ]
# Configure and quantize
quant_config = QuantizeConfig( bits = 4 , group_size = 128 )
model = GPTQModel.load(model_id, quant_config)
model.quantize(calibration_dataset, batch_size = 2 )
model.save(quant_path)
LLM Compressor
From the vLLM project, supports FP8 and other formats.
pip install llmcompressor
from transformers import AutoTokenizer
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
# Load model
model = SparseAutoModelForCausalLM.from_pretrained(
MODEL_ID , device_map = "auto" , torch_dtype = "auto"
)
tokenizer = AutoTokenizer.from_pretrained( MODEL_ID )
# Configure FP8 quantization
recipe = QuantizationModifier(
targets = "Linear" ,
scheme = "FP8_DYNAMIC" ,
ignore = [ "lm_head" ]
)
# Apply quantization
oneshot( model = model, recipe = recipe)
# Save
SAVE_DIR = MODEL_ID .split( "/" )[ 1 ] + "-FP8-Dynamic"
model.save_pretrained( SAVE_DIR )
tokenizer.save_pretrained( SAVE_DIR )
Deploy:
python -m sglang.launch_server \
--model-path ./Meta-Llama-3-8B-Instruct-FP8-Dynamic
Online Quantization
Quantize weights dynamically at server startup.
FP8 Online
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--quantization fp8 \
--port 30000
TorchAO Quantization
SGLang supports torchao quantization methods:
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--torchao-config int4wo-128 \
--port 30000
Supported Methods
int8dq - INT8 dynamic quantization (⚠️ disable CUDA graph with --disable-cuda-graph)
int8wo - INT8 weight-only
fp8wo - FP8 weight-only
fp8dq-per_tensor - FP8 dynamic per-tensor
fp8dq-per_row - FP8 dynamic per-row
int4wo-32, int4wo-64, int4wo-128, int4wo-256 - INT4 weight-only with different group sizes
int8dq has issues with CUDA graph capture. Always use --disable-cuda-graph with this method.
AMD GPU Quantization
For AMD GPUs (CDNA3/CDNA4), use quark_int4fp8_moe to quantize MoE layers:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \
--quantization quark_int4fp8_moe \
--port 30000
This quantizes:
MoE layers: weights to INT4, upcasted to FP8 for compute
Other layers: weights to FP8 directly
Pre-Quantized Model Sources
Unsloth High-quality quantized models
NVIDIA ModelOpt NVIDIA-optimized models
NeuralMagic Sparse and quantized models
Always validate quantized models via benchmarks post-quantization to guard against quality degradation.
Memory Reduction
Precision Memory vs FP16 Typical Use Case FP16/BF16 1.0× (baseline) Full precision FP8 0.5× Hopper/Blackwell GPUs INT8 0.5× Broad compatibility FP4/INT4 0.25× Maximum compression
Throughput Improvements
Quantization typically provides:
1.5-2× throughput with FP8/INT8
2-3× throughput with FP4/INT4
Lower latency due to reduced memory bandwidth
Higher batch sizes due to memory savings
Known Limitations
Not fully supported due to vLLM’s layer fusion (e.g., QKV fusion). Different bit-widths within fused layers can cause compatibility issues.
May encounter issues due to kernel limitations. Try skipping problematic layers like mlp.gate.
Limited support. Some format combinations may fail. AWQ format typically works best.
References