Skip to main content

Qwen3-TTS Documentation

A powerful Python SDK for text-to-speech generation with voice cloning, voice design, and ultra-low latency streaming supporting 10 major languages.

Quick Start

Get up and running with Qwen3-TTS in minutes

1

Install the package

Install Qwen3-TTS using pip in a fresh Python environment:
pip install -U qwen-tts
We recommend using Python 3.12 with a clean conda or virtual environment to avoid dependency conflicts.
2

Load a model

Import and initialize a Qwen3-TTS model:
import torch
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
)
3

Generate speech

Generate natural-sounding speech from text:
import soundfile as sf

wavs, sr = model.generate_custom_voice(
    text="Hello, welcome to Qwen3-TTS!",
    language="English",
    speaker="Ryan",
)

sf.write("output.wav", wavs[0], sr)
The model automatically downloads from Hugging Face on first use. For offline environments, see the Installation guide.

Explore by Feature

Discover what you can build with Qwen3-TTS

Custom Voice

Generate speech with 9 premium preset voices covering multiple languages and dialects

Voice Design

Create custom voices from natural language descriptions with instruction-based control

Voice Cloning

Clone any voice in just 3 seconds from a reference audio sample

Streaming

Ultra-low latency streaming with 97ms end-to-end synthesis for real-time interactions

Batch Processing

Process multiple text inputs efficiently with batched inference

Fine-tuning

Customize models for your specific use case with fine-tuning

Key Features

What makes Qwen3-TTS powerful

10 Languages

Support for Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian with multilingual and cross-lingual capabilities.

Ultra-Low Latency

Achieve 97ms end-to-end synthesis latency with streaming generation, perfect for real-time interactive applications.

High Quality

Powered by Qwen3-TTS-Tokenizer-12Hz for efficient acoustic compression and high-fidelity speech reconstruction.

Instruction Control

Control voice characteristics, emotion, and prosody using natural language instructions for expressive speech output.

Ready to get started?

Install Qwen3-TTS and generate high-quality speech in minutes

Get Started Now

Build docs developers (and LLMs) love