Overview
Voice cloning allows you to create custom voices from audio samples. Upload a recording or record directly in the browser to generate a voice that can be used for text-to-speech generation.Requirements
Voice creation requires an active subscription. The API will return
SUBSCRIPTION_REQUIRED (403) without an active plan.Audio Requirements
Audio file containing the voice sample
- Maximum size: 20 MB
- Minimum duration: 10 seconds
- Supported formats: All audio formats (WAV, MP3, M4A, FLAC, etc.)
- Content-Type: Must match actual audio format
Voice Metadata
Display name for the voice
- Minimum length: 1 character
- This name will appear in voice selectors
Voice category for organizationAvailable categories:
AUDIOBOOK- Audiobook narrationCONVERSATIONAL- Natural conversationCUSTOMER_SERVICE- Support and serviceGENERAL- General purpose (default)NARRATIVE- StorytellingCHARACTERS- Character voicesMEDITATION- Calm and soothingMOTIVATIONAL- Energetic and inspiringPODCAST- Podcast hostingADVERTISING- Marketing and adsVOICEOVER- Professional voiceoverCORPORATE- Business presentations
Language code for the voice
- Format: BCP 47 language tag (e.g.,
en-US,es-MX,fr-FR) - Default:
en-US - Supports all locale-codes locales with region (e.g.,
en-US, not justen)
Optional description of the voice characteristics
- Helps identify and organize voices
- Searchable in voice library
Creation Methods
- Upload File
- Record Audio
Upload an existing audio file from your device.
Choose file
- Click “Upload file” or drag and drop
- File must be under 20 MB
- Preview and playback available after upload
API Workflow
Voice creation uses a REST API endpoint (not tRPC) due to file upload requirements:Endpoint
Request Format
Validation Steps
The API performs several validation checks:Response
Success (201):401 Unauthorized
401 Unauthorized
403 SUBSCRIPTION_REQUIRED
403 SUBSCRIPTION_REQUIRED
No active subscription found for the organization.
400 Invalid Input
400 Invalid Input
Query parameters failed validation (missing name, invalid category, etc.).
400 Missing Content-Type
400 Missing Content-Type
Request must include Content-Type header matching audio format.
413 File Too Large
413 File Too Large
Audio file exceeds the 20 MB size limit.
422 Invalid Audio File
422 Invalid Audio File
File is not a valid audio file or cannot be parsed.
422 Audio Too Short
422 Audio Too Short
Audio duration is less than 10 seconds minimum.
500 Creation Failed
500 Creation Failed
Voice creation or upload failed. Partial data is automatically cleaned up.
Storage Structure
Custom voices are stored with the following structure:- Organization scoped: Each org’s voices are isolated
- Unique IDs: Voice IDs are generated using CUID
- Variant: Custom voices have
variant: "CUSTOM" - Ownership: Linked to creating organization via
orgId
Usage Tracking
Voice creation is metered for billing:Usage tracking is fire-and-forget. Tracking failures won’t block voice creation.
Best Practices
- Audio Quality: Use high-quality audio samples with minimal background noise
- Duration: Longer samples (30+ seconds) typically produce better clones
- Consistency: Single speaker, consistent volume and tone
- Language: Ensure audio matches selected language code
- Categories: Choose appropriate category for better organization