Quick Start
The server will start on
http://127.0.0.1:8000 by default. Visit http://localhost:8000/docs for interactive API documentation.Configuration Options
Command Line Arguments
Model checkpoint name or path. Can be:
- HuggingFace model name:
Qwen/Qwen-7B-Chat - Local path:
/path/to/model
Port to run the API server on
Server bind address:
127.0.0.1: Local access only0.0.0.0: Accept connections from any network interface
Run the model on CPU only (not recommended for production)
Disable garbage collection after each response (improves latency but increases memory usage)
Enable basic HTTP authentication in format
username:passwordAPI Usage
Using OpenAI Python Client
Using cURL
Request Parameters
Model identifier (ignored by server, uses the loaded model)
Array of message objects with
role and content fieldsSampling temperature (0.0 to 2.0). Lower values make output more focused and deterministic
Nucleus sampling parameter. Alternative to temperature
Top-k sampling parameter. Limits token selection to top k options
Maximum total sequence length (prompt + completion)
Enable streaming responses
Array of stop sequences to halt generation
Array of function definitions for function calling
Production Deployment
Using Gunicorn
For production deployments with multiple workers:Using Systemd Service
Create a systemd service file/etc/systemd/system/qwen-api.service:
Behind Nginx Reverse Proxy
Nginx configuration for SSL termination and load balancing:Authentication
Basic HTTP Authentication
Enable authentication when starting the server:Custom Authentication
For OAuth2, JWT, or custom authentication, modify theopenai_api.py script to add middleware:
Monitoring
Health Check Endpoint
Add a health check endpoint to your deployment:Logging
Enable detailed logging:Performance Tips
Optimize Memory Usage
Optimize Memory Usage
- Use
--disable-gcfor lower latency at the cost of higher memory usage - Enable KV cache quantization in the model config
- Use quantized models (Int4/Int8) for reduced VRAM requirements
Improve Throughput
Improve Throughput
- Use vLLM-based deployment for high-concurrency scenarios
- Enable Flash Attention 2 in the model
- Consider multi-GPU deployment with tensor parallelism
Reduce Latency
Reduce Latency
- Use smaller models when possible
- Implement request batching
- Use bfloat16 precision instead of float32
- Pre-load model at startup
Troubleshooting
Server fails to start
Server fails to start
Error:
RuntimeError: CUDA out of memorySolution: Use a smaller model or quantized version:Slow response times
Slow response times
Issue: API responses are slowSolutions:
- Install Flash Attention 2
- Use GPU instead of CPU
- Reduce max_length parameter
- Consider vLLM deployment for better performance
Connection refused
Connection refused
Error: Cannot connect to API serverSolution: Ensure server is bound to correct interface:Check firewall rules and network configuration.
Next Steps
Docker Deployment
Deploy with Docker for easier management
vLLM Deployment
Use vLLM for high-performance production inference
Production Guide
Best practices for production deployments
API Reference
Complete API documentation