Multi-speaker fine-tuning and other advanced fine-tuning features will be supported in future releases.
When to Fine-tune
Fine-tuning is recommended when you need:- Consistent custom voice: Train the model to generate a specific voice character consistently
- Domain adaptation: Improve performance on specialized vocabulary or speaking styles
- Voice quality: Enhance the naturalness and consistency for a particular speaker
- Single-speaker applications: Apps or services that use one primary voice throughout
Prerequisites
Before starting, ensure you have:- Installed the qwen-tts package:
- Cloned the repository:
Dataset Preparation
Input Format
Prepare your training data as a JSONL file (one JSON object per line). Each line must contain three fields:audio: Path to the target training audio (WAV format)text: Transcript corresponding to the audioref_audio: Path to the reference speaker audio (WAV format)
Example JSONL
Reference Audio Recommendations
Keepingref_audio identical across the dataset usually improves:
- Speaker consistency during generation
- Training stability
- Voice quality in the fine-tuned model
- Be 3-10 seconds long
- Contain clear speech from the target speaker
- Have minimal background noise
- Be in WAV format at 24kHz sampling rate
Training Process
Step 1: Extract Audio Codes
Convert your raw JSONL into a training-ready format that includes audio codes:--device: GPU device to use (e.g.,cuda:0)--tokenizer_model_path: Path or model ID of the tokenizer--input_jsonl: Path to your raw JSONL file--output_jsonl: Output path for processed JSONL with audio codes
Step 2: Run Fine-tuning
Launch the supervised fine-tuning (SFT) process:| Parameter | Description | Recommended Value |
|---|---|---|
--init_model_path | Base model to fine-tune | Qwen/Qwen3-TTS-12Hz-1.7B-Base or Qwen/Qwen3-TTS-12Hz-0.6B-Base |
--output_model_path | Directory to save checkpoints | output |
--train_jsonl | Training data with codes | Path from Step 1 |
--batch_size | Batch size per GPU | 2-32 (depends on GPU memory) |
--lr | Learning rate | 2e-6 to 2e-5 |
--num_epochs | Number of training epochs | 3-10 |
--speaker_name | Name for the custom speaker | Any string (e.g., speaker_test) |
Step 3: Test Your Model
Quickly test the fine-tuned model with inference:Configuration Options
Hardware Requirements
- GPU Memory: 16GB+ recommended for 1.7B model with batch_size=2
- RAM: 32GB+ recommended
- GPU: NVIDIA GPU with CUDA support, FlashAttention 2 compatible
Training Configuration
Thesft_12hz.py script uses the following configuration:
- Mixed precision: bfloat16 for memory efficiency
- Gradient accumulation: 4 steps
- Optimizer: AdamW with weight decay 0.01
- Gradient clipping: Max norm 1.0
- Loss function: Combined main codec loss + 0.3 × sub-talker loss
Data Configuration
Indataset.py, the training pipeline:
- Loads audio at 24kHz sampling rate
- Extracts mel-spectrograms (128 mels, 1024 FFT size)
- Tokenizes text with Qwen3-TTS processor
- Creates dual-channel inputs (text + codec channels)
- Applies speaker embeddings from reference audio
Best Practices
Data Collection
- Quality over quantity: 30-100 high-quality samples often work better than 1000+ low-quality samples
- Diverse content: Include varied sentences covering different phonemes and prosody patterns
- Clean audio: Use professional recordings or well-processed audio with minimal noise
- Consistent environment: Record all samples in similar acoustic conditions
Training Tips
- Start with lower learning rate: 2e-6 is safer; increase to 2e-5 if underfitting
- Monitor loss: Training loss should steadily decrease; if it plateaus early, try:
- Increasing learning rate
- Adding more diverse training data
- Training for more epochs
- Batch size: Adjust based on GPU memory:
- 24GB GPU: batch_size 32 for 0.6B, 16 for 1.7B
- 16GB GPU: batch_size 16 for 0.6B, 2-4 for 1.7B
- Checkpoint selection: Test multiple epoch checkpoints; later isn’t always better
Evaluation
- Listen to outputs: Subjective quality is most important
- Test diverse inputs: Try short/long sentences, different emotions, edge cases
- Compare to base model: Ensure fine-tuning improved quality for your use case
- Check consistency: Generate same sentence multiple times to verify consistency
One-Click Training Script
For convenience, combine all steps into a single shell script:train.sh, make it executable (chmod +x train.sh), and run with ./train.sh.
Troubleshooting
Out of Memory (OOM)
- Reduce
--batch_size - Use the 0.6B model instead of 1.7B
- Enable gradient checkpointing (modify
sft_12hz.py)
Poor Quality Output
- Ensure reference audio is high quality
- Use the same ref_audio for all training samples
- Train for more epochs (try 10-20)
- Check that audio files are 24kHz WAV format
Training Too Slow
- Increase batch size if GPU memory allows
- Use multiple GPUs (requires modifying training script)
- Ensure FlashAttention 2 is installed
Next Steps
- Learn about vLLM integration for optimized inference
- Explore performance optimization techniques
- Try the DashScope API for cloud deployment