OpenAI-Compatible API Server

Deploy Qwen models with a production-ready API server that’s compatible with OpenAI’s API format. This allows you to use existing OpenAI client libraries and tools with Qwen.

Quick Start

Install Dependencies

Install the required packages:

pip install fastapi uvicorn "openai<1.0.0" sse_starlette "pydantic<=1.10.13"

Download the API Script

The openai_api.py script is included in the Qwen repository:

git clone https://github.com/QwenLM/Qwen.git
cd Qwen

Launch the Server

Start the API server with default settings:

python openai_api.py -c Qwen/Qwen-7B-Chat

The server will start on http://127.0.0.1:8000 by default. Visit http://localhost:8000/docs for interactive API documentation.

Configuration Options

Command Line Arguments

python openai_api.py \
  --checkpoint-path Qwen/Qwen-7B-Chat \
  --server-port 8000 \
  --server-name 0.0.0.0 \
  --cpu-only \
  --disable-gc \
  --api-auth username:password

checkpoint-path

string

default:"Qwen/Qwen-7B-Chat"

Model checkpoint name or path. Can be:

HuggingFace model name: Qwen/Qwen-7B-Chat
Local path: /path/to/model

server-port

int

default:"8000"

Port to run the API server on

server-name

string

default:"127.0.0.1"

Server bind address:

127.0.0.1: Local access only
0.0.0.0: Accept connections from any network interface

cpu-only

boolean

default:"false"

Run the model on CPU only (not recommended for production)

disable-gc

boolean

default:"false"

Disable garbage collection after each response (improves latency but increases memory usage)

api-auth

string

Enable basic HTTP authentication in format username:password

API Usage

Using OpenAI Python Client

import openai

openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"  # Not required unless auth is enabled

# Non-streaming request
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",  # Model name is ignored, uses loaded model
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is quantum computing?"}
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=2048
)

print(response.choices[0].message.content)

Using cURL

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7
  }'

Request Parameters

model

string

required

Model identifier (ignored by server, uses the loaded model)

messages

array

required

Array of message objects with role and content fields

temperature

float

default:"1.0"

Sampling temperature (0.0 to 2.0). Lower values make output more focused and deterministic

top_p

float

default:"1.0"

Nucleus sampling parameter. Alternative to temperature

top_k

int

Top-k sampling parameter. Limits token selection to top k options

max_length

int

Maximum total sequence length (prompt + completion)

stream

boolean

default:"false"

Enable streaming responses

stop

array

Array of stop sequences to halt generation

functions

array

Array of function definitions for function calling

Production Deployment

Using Gunicorn

For production deployments with multiple workers:

gunicorn openai_api:app \
  --workers 4 \
  --worker-class uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000 \
  --timeout 300 \
  --access-logfile access.log \
  --error-logfile error.log

Multiple workers require multiple GPUs or CPU-only deployment. Each worker loads a full model instance.

Using Systemd Service

Create a systemd service file /etc/systemd/system/qwen-api.service:

[Unit]
Description=Qwen OpenAI API Server
After=network.target

[Service]
Type=simple
User=qwen
WorkingDirectory=/opt/qwen
Environment="PATH=/opt/qwen/venv/bin"
ExecStart=/opt/qwen/venv/bin/python openai_api.py -c /models/Qwen-7B-Chat --server-name 0.0.0.0 --server-port 8000
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable qwen-api
sudo systemctl start qwen-api
sudo systemctl status qwen-api

Behind Nginx Reverse Proxy

Nginx configuration for SSL termination and load balancing:

upstream qwen_api {
    server 127.0.0.1:8000;
    # Add more backend servers for load balancing
    # server 127.0.0.1:8001;
}

server {
    listen 443 ssl http2;
    server_name api.example.com;

    ssl_certificate /etc/ssl/certs/api.example.com.crt;
    ssl_certificate_key /etc/ssl/private/api.example.com.key;

    location / {
        proxy_pass http://qwen_api;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # For streaming responses
        proxy_buffering off;
        proxy_cache off;
        
        # Timeouts
        proxy_connect_timeout 300s;
        proxy_send_timeout 300s;
        proxy_read_timeout 300s;
    }
}

Authentication

Basic HTTP Authentication

Enable authentication when starting the server:

python openai_api.py \
  -c Qwen/Qwen-7B-Chat \
  --api-auth admin:secret_password

Client usage:

import openai
import base64

openai.api_base = "http://localhost:8000/v1"
# Set the authorization header
credentials = base64.b64encode(b"admin:secret_password").decode()
openai.api_key = credentials

Custom Authentication

For OAuth2, JWT, or custom authentication, modify the openai_api.py script to add middleware:

from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from fastapi import Security, HTTPException

security = HTTPBearer()

async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
    token = credentials.credentials
    # Add your token verification logic here
    if not verify_jwt_token(token):
        raise HTTPException(status_code=401, detail="Invalid token")
    return token

# Add to endpoints
@app.post('/v1/chat/completions', dependencies=[Depends(verify_token)])
async def create_chat_completion(request: ChatCompletionRequest):
    # ... existing code

Monitoring

Health Check Endpoint

Add a health check endpoint to your deployment:

@app.get("/health")
async def health_check():
    return {"status": "healthy", "model": "Qwen-7B-Chat"}

Logging

Enable detailed logging:

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('qwen_api.log'),
        logging.StreamHandler()
    ]
)

Performance Tips

Optimize Memory Usage

Use --disable-gc for lower latency at the cost of higher memory usage
Enable KV cache quantization in the model config
Use quantized models (Int4/Int8) for reduced VRAM requirements

Improve Throughput

Use vLLM-based deployment for high-concurrency scenarios
Enable Flash Attention 2 in the model
Consider multi-GPU deployment with tensor parallelism

Reduce Latency

Use smaller models when possible
Implement request batching
Use bfloat16 precision instead of float32
Pre-load model at startup

Troubleshooting

Server fails to start

Error: RuntimeError: CUDA out of memorySolution: Use a smaller model or quantized version:

python openai_api.py -c Qwen/Qwen-7B-Chat-Int4

Slow response times

Issue: API responses are slowSolutions:

Install Flash Attention 2
Use GPU instead of CPU
Reduce max_length parameter
Consider vLLM deployment for better performance

Connection refused

Error: Cannot connect to API serverSolution: Ensure server is bound to correct interface:

python openai_api.py --server-name 0.0.0.0 --server-port 8000

Check firewall rules and network configuration.

Next Steps

Docker Deployment

Deploy with Docker for easier management

vLLM Deployment

Use vLLM for high-performance production inference

Production Guide

Best practices for production deployments

API Reference

Complete API documentation

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

OpenAI-Compatible API Server

Quick Start

Configuration Options

Command Line Arguments

API Usage

Using OpenAI Python Client

Using cURL

Request Parameters

Production Deployment

Using Gunicorn

Using Systemd Service

Behind Nginx Reverse Proxy

Authentication

Basic HTTP Authentication

Custom Authentication

Monitoring

Health Check Endpoint

Logging

Performance Tips

Troubleshooting

Next Steps

Docker Deployment

vLLM Deployment

Production Guide

API Reference

Build docs developers (and LLMs) love

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

​Quick Start

​Configuration Options

​Command Line Arguments

​API Usage

​Using OpenAI Python Client

​Using cURL

​Request Parameters

​Production Deployment

​Using Gunicorn

​Using Systemd Service

​Behind Nginx Reverse Proxy

​Authentication

​Basic HTTP Authentication

​Custom Authentication

​Monitoring

​Health Check Endpoint

​Logging

​Performance Tips

​Troubleshooting

​Next Steps

Docker Deployment

vLLM Deployment

Production Guide

API Reference

Build docs developers (and LLMs) love

Quick Start

Configuration Options

Command Line Arguments

API Usage

Using OpenAI Python Client

Using cURL

Request Parameters

Production Deployment

Using Gunicorn

Using Systemd Service

Behind Nginx Reverse Proxy

Authentication

Basic HTTP Authentication

Custom Authentication

Monitoring

Health Check Endpoint

Logging

Performance Tips

Troubleshooting

Next Steps