Skip to main content
Deploy Qwen models with a production-ready API server that’s compatible with OpenAI’s API format. This allows you to use existing OpenAI client libraries and tools with Qwen.

Quick Start

1

Install Dependencies

Install the required packages:
pip install fastapi uvicorn "openai<1.0.0" sse_starlette "pydantic<=1.10.13"
2

Download the API Script

The openai_api.py script is included in the Qwen repository:
git clone https://github.com/QwenLM/Qwen.git
cd Qwen
3

Launch the Server

Start the API server with default settings:
python openai_api.py -c Qwen/Qwen-7B-Chat
The server will start on http://127.0.0.1:8000 by default. Visit http://localhost:8000/docs for interactive API documentation.

Configuration Options

Command Line Arguments

python openai_api.py \
  --checkpoint-path Qwen/Qwen-7B-Chat \
  --server-port 8000 \
  --server-name 0.0.0.0 \
  --cpu-only \
  --disable-gc \
  --api-auth username:password
checkpoint-path
string
default:"Qwen/Qwen-7B-Chat"
Model checkpoint name or path. Can be:
  • HuggingFace model name: Qwen/Qwen-7B-Chat
  • Local path: /path/to/model
server-port
int
default:"8000"
Port to run the API server on
server-name
string
default:"127.0.0.1"
Server bind address:
  • 127.0.0.1: Local access only
  • 0.0.0.0: Accept connections from any network interface
cpu-only
boolean
default:"false"
Run the model on CPU only (not recommended for production)
disable-gc
boolean
default:"false"
Disable garbage collection after each response (improves latency but increases memory usage)
api-auth
string
Enable basic HTTP authentication in format username:password

API Usage

Using OpenAI Python Client

import openai

openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"  # Not required unless auth is enabled

# Non-streaming request
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",  # Model name is ignored, uses loaded model
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is quantum computing?"}
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=2048
)

print(response.choices[0].message.content)

Using cURL

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7
  }'

Request Parameters

model
string
required
Model identifier (ignored by server, uses the loaded model)
messages
array
required
Array of message objects with role and content fields
temperature
float
default:"1.0"
Sampling temperature (0.0 to 2.0). Lower values make output more focused and deterministic
top_p
float
default:"1.0"
Nucleus sampling parameter. Alternative to temperature
top_k
int
Top-k sampling parameter. Limits token selection to top k options
max_length
int
Maximum total sequence length (prompt + completion)
stream
boolean
default:"false"
Enable streaming responses
stop
array
Array of stop sequences to halt generation
functions
array
Array of function definitions for function calling

Production Deployment

Using Gunicorn

For production deployments with multiple workers:
gunicorn openai_api:app \
  --workers 4 \
  --worker-class uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000 \
  --timeout 300 \
  --access-logfile access.log \
  --error-logfile error.log
Multiple workers require multiple GPUs or CPU-only deployment. Each worker loads a full model instance.

Using Systemd Service

Create a systemd service file /etc/systemd/system/qwen-api.service:
[Unit]
Description=Qwen OpenAI API Server
After=network.target

[Service]
Type=simple
User=qwen
WorkingDirectory=/opt/qwen
Environment="PATH=/opt/qwen/venv/bin"
ExecStart=/opt/qwen/venv/bin/python openai_api.py -c /models/Qwen-7B-Chat --server-name 0.0.0.0 --server-port 8000
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
Enable and start the service:
sudo systemctl daemon-reload
sudo systemctl enable qwen-api
sudo systemctl start qwen-api
sudo systemctl status qwen-api

Behind Nginx Reverse Proxy

Nginx configuration for SSL termination and load balancing:
upstream qwen_api {
    server 127.0.0.1:8000;
    # Add more backend servers for load balancing
    # server 127.0.0.1:8001;
}

server {
    listen 443 ssl http2;
    server_name api.example.com;

    ssl_certificate /etc/ssl/certs/api.example.com.crt;
    ssl_certificate_key /etc/ssl/private/api.example.com.key;

    location / {
        proxy_pass http://qwen_api;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # For streaming responses
        proxy_buffering off;
        proxy_cache off;
        
        # Timeouts
        proxy_connect_timeout 300s;
        proxy_send_timeout 300s;
        proxy_read_timeout 300s;
    }
}

Authentication

Basic HTTP Authentication

Enable authentication when starting the server:
python openai_api.py \
  -c Qwen/Qwen-7B-Chat \
  --api-auth admin:secret_password
Client usage:
import openai
import base64

openai.api_base = "http://localhost:8000/v1"
# Set the authorization header
credentials = base64.b64encode(b"admin:secret_password").decode()
openai.api_key = credentials

Custom Authentication

For OAuth2, JWT, or custom authentication, modify the openai_api.py script to add middleware:
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from fastapi import Security, HTTPException

security = HTTPBearer()

async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
    token = credentials.credentials
    # Add your token verification logic here
    if not verify_jwt_token(token):
        raise HTTPException(status_code=401, detail="Invalid token")
    return token

# Add to endpoints
@app.post('/v1/chat/completions', dependencies=[Depends(verify_token)])
async def create_chat_completion(request: ChatCompletionRequest):
    # ... existing code

Monitoring

Health Check Endpoint

Add a health check endpoint to your deployment:
@app.get("/health")
async def health_check():
    return {"status": "healthy", "model": "Qwen-7B-Chat"}

Logging

Enable detailed logging:
import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('qwen_api.log'),
        logging.StreamHandler()
    ]
)

Performance Tips

  • Use --disable-gc for lower latency at the cost of higher memory usage
  • Enable KV cache quantization in the model config
  • Use quantized models (Int4/Int8) for reduced VRAM requirements
  • Use vLLM-based deployment for high-concurrency scenarios
  • Enable Flash Attention 2 in the model
  • Consider multi-GPU deployment with tensor parallelism
  • Use smaller models when possible
  • Implement request batching
  • Use bfloat16 precision instead of float32
  • Pre-load model at startup

Troubleshooting

Error: RuntimeError: CUDA out of memorySolution: Use a smaller model or quantized version:
python openai_api.py -c Qwen/Qwen-7B-Chat-Int4
Issue: API responses are slowSolutions:
  • Install Flash Attention 2
  • Use GPU instead of CPU
  • Reduce max_length parameter
  • Consider vLLM deployment for better performance
Error: Cannot connect to API serverSolution: Ensure server is bound to correct interface:
python openai_api.py --server-name 0.0.0.0 --server-port 8000
Check firewall rules and network configuration.

Next Steps

Docker Deployment

Deploy with Docker for easier management

vLLM Deployment

Use vLLM for high-performance production inference

Production Guide

Best practices for production deployments

API Reference

Complete API documentation

Build docs developers (and LLMs) love