Skip to main content

Overview

The TTS (Text-to-Speech) module provides offline speech synthesis using various model types. Generate complete audio from text, with support for multiple speakers, adjustable speed, and model-specific parameters. Key features:
  • Multiple model types (VITS, Matcha, Kokoro, Kitten, Pocket, Zipvoice)
  • Multi-speaker support (speaker selection by ID)
  • Adjustable speech speed
  • Voice cloning with reference audio (Pocket, Zipvoice)
  • Timestamp generation
  • Save to WAV files or play directly

Quick Start

import { createTTS, saveAudioToFile } from 'react-native-sherpa-onnx/tts';

// Create a TTS engine
const tts = await createTTS({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-vits-piper-en_US-libritts_r-medium',
  },
  modelType: 'auto',  // Auto-detect from files
  numThreads: 2,
});

// Generate speech
const audio = await tts.generateSpeech('Hello, world!');
console.log('Sample rate:', audio.sampleRate);
console.log('Samples:', audio.samples.length);

// Save to file
await saveAudioToFile(audio, '/path/to/output.wav');

// Clean up
await tts.destroy();

Supported Model Types

Model TypeDescriptionConfig Options
vitsVITS models (Piper, Coqui, MeloTTS, MMS)noiseScale, noiseScaleW, lengthScale
matchaMatcha modelsnoiseScale, lengthScale
kokoroKokoro (multi-speaker, multi-language)lengthScale
kittenKittenTTS (lightweight)lengthScale
pocketPocket TTS (voice cloning)Voice cloning via referenceAudio
zipvoiceZipvoice (voice cloning)Voice cloning via referenceAudio
autoAuto-detect from files
Use modelType: 'auto' to automatically detect the model type. The SDK will choose the correct type based on files in the directory.

API Reference

createTTS(options)

Creates a TTS engine for batch (one-shot) speech generation.
src/tts/index.ts
export async function createTTS(
  options: TTSInitializeOptions | ModelPathConfig
): Promise<TtsEngine>;
Options:
modelPath
ModelPathConfig
required
Model directory path. Use { type: 'asset', path: 'models/...' } for bundled assets.
modelType
TTSModelType
default:"auto"
Model type: 'vits', 'matcha', 'kokoro', 'kitten', 'pocket', 'zipvoice', or 'auto'.
numThreads
number
default:"2"
Number of threads for inference. More threads = faster but more CPU usage.
provider
string
default:"cpu"
Execution provider (e.g., 'cpu', 'coreml', 'xnnpack'). See Execution Providers.
debug
boolean
default:"false"
Enable debug logging.
modelOptions
TtsModelOptions
Model-specific configuration. Only the block for the loaded model type is applied:
  • vits: { noiseScale, noiseScaleW, lengthScale }
  • matcha: { noiseScale, lengthScale }
  • kokoro: { lengthScale }
  • kitten: { lengthScale }
ruleFsts
string
Path(s) to rule FSTs for text normalization.
ruleFars
string
Path(s) to rule FARs for text normalization.
maxNumSentences
number
default:"1"
Max sentences per streaming callback.
silenceScale
number
default:"0.2"
Silence scale on config level.

TtsEngine: generateSpeech(text, options?)

Generate speech audio from text.
const audio = await tts.generateSpeech(
  'Hello, world!',
  {
    sid: 0,        // Speaker ID
    speed: 1.0,    // Speech speed
  }
);
Returns GeneratedAudio:
interface GeneratedAudio {
  samples: number[];    // Float PCM in [-1.0, 1.0]
  sampleRate: number;   // Sample rate in Hz (e.g., 22050)
}
Generation Options:
sid
number
default:"0"
Speaker ID for multi-speaker models. Use getNumSpeakers() to check available speakers.
speed
number
default:"1.0"
Speech speed multiplier:
  • 1.0 = normal speed
  • 0.5 = half speed (slower)
  • 2.0 = double speed (faster)
silenceScale
number
Silence scale at generation time (model-dependent).
referenceAudio
{ samples: number[], sampleRate: number }
Reference audio for voice cloning (Pocket, Zipvoice). Mono float samples in [-1, 1].
referenceText
string
Transcript of reference audio (required when using referenceAudio).
numSteps
number
Flow-matching steps (Pocket TTS).
extra
Record<string, string>
Model-specific options (e.g., Pocket: { temperature: '0.7', chunk_size: '15' }).

TtsEngine: generateSpeechWithTimestamps(text, options?)

Generate speech with word-level timestamps.
const result = await tts.generateSpeechWithTimestamps('Hello world');

console.log('Samples:', result.samples);
console.log('Sample rate:', result.sampleRate);
console.log('Subtitles:', result.subtitles);
// [{ text: 'Hello', start: 0, end: 0.5 }, { text: 'world', start: 0.6, end: 1.0 }]
console.log('Estimated:', result.estimated);  // true if timestamps are estimated

TtsEngine: updateParams(options)

Update model parameters at runtime without reloading.
await tts.updateParams({
  modelOptions: {
    vits: {
      noiseScale: 0.7,
      lengthScale: 1.2,
    },
  },
});

TtsEngine: getModelInfo()

Get model information (sample rate and number of speakers).
const info = await tts.getModelInfo();
console.log('Sample rate:', info.sampleRate);
console.log('Number of speakers:', info.numSpeakers);

TtsEngine: getSampleRate()

Get the model’s sample rate.
const sampleRate = await tts.getSampleRate();
console.log('Sample rate:', sampleRate);  // e.g., 22050

TtsEngine: getNumSpeakers()

Get the number of available speakers.
const numSpeakers = await tts.getNumSpeakers();
console.log('Speakers:', numSpeakers);  // 0 or 1 = single-speaker, >1 = multi-speaker

TtsEngine: destroy()

Release native resources. Must be called when done.
await tts.destroy();

Saving Audio

Save to File

import { saveAudioToFile } from 'react-native-sherpa-onnx/tts';

const audio = await tts.generateSpeech('Hello, world!');
await saveAudioToFile(audio, '/path/to/output.wav');

Android: Save via SAF (Storage Access Framework)

import { saveAudioToContentUri } from 'react-native-sherpa-onnx/tts';

// User selects directory via SAF
const directoryUri = 'content://...';

const audio = await tts.generateSpeech('Hello, world!');
const fileUri = await saveAudioToContentUri(
  audio,
  directoryUri,
  'output.wav'
);

console.log('Saved to:', fileUri);

Share Audio File

import { shareAudioFile } from 'react-native-sherpa-onnx/tts';

// Save first
const filePath = '/path/to/output.wav';
await saveAudioToFile(audio, filePath);

// Share
await shareAudioFile(filePath, 'audio/wav');

Model-Specific Configuration

VITS Models

VITS models support three tuning parameters:
const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper-en' },
  modelType: 'vits',
  modelOptions: {
    vits: {
      noiseScale: 0.667,    // Controls randomness (0.0-1.0)
      noiseScaleW: 0.8,     // Duration randomness (0.0-1.0)
      lengthScale: 1.0,     // Speech speed (0.5=slower, 2.0=faster)
    },
  },
});

Matcha Models

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/matcha-en' },
  modelType: 'matcha',
  modelOptions: {
    matcha: {
      noiseScale: 0.667,
      lengthScale: 1.0,
    },
  },
});

Kokoro Models

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/kokoro-en' },
  modelType: 'kokoro',
  modelOptions: {
    kokoro: {
      lengthScale: 1.2,  // Only lengthScale for Kokoro
    },
  },
});

Voice Cloning

Pocket and Zipvoice models support voice cloning via reference audio.

Pocket TTS (Voice Cloning)

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/pocket-tts' },
  modelType: 'pocket',
});

// Load reference audio (3-10 seconds recommended)
const refAudio = loadReferenceAudio();  // Your audio loading function

const audio = await tts.generateSpeech(
  'This is the target text to speak.',
  {
    referenceAudio: {
      samples: refAudio.samples,  // Float32 mono [-1, 1]
      sampleRate: 22050,
    },
    referenceText: 'This is what the reference recording says.',
    numSteps: 20,      // Flow-matching steps (higher = better quality, slower)
    speed: 1.0,
    extra: {
      temperature: '0.7',   // Randomness (0.0-1.0)
      chunk_size: '15',     // Chunk size for generation
    },
  }
);

Zipvoice (Voice Cloning)

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/zipvoice-zh-en' },
  modelType: 'zipvoice',
});

const audio = await tts.generateSpeech(
  'Target speech text.',
  {
    referenceAudio: {
      samples: refAudioSamples,
      sampleRate: 24000,
    },
    referenceText: 'Transcript of reference.',
    numSteps: 20,
    speed: 1.0,
  }
);
Zipvoice streaming with voice cloning is not supported. Use generateSpeech() (batch mode) for voice cloning with Zipvoice. For Pocket TTS, both batch and streaming modes support voice cloning.
Zipvoice Memory Requirements:The full fp32 Zipvoice model (~605 MB) requires significant RAM. On devices with less than 8 GB RAM, use the int8 distill variant (sherpa-onnx-zipvoice-distill-int8-zh-en-emilia, ~104 MB) to avoid crashes.The SDK checks free memory before loading and rejects initialization if below ~800 MB.

Multi-Speaker Models

Some models include multiple speakers (voices).
const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-multi-speaker' },
});

// Check available speakers
const numSpeakers = await tts.getNumSpeakers();
console.log('Available speakers:', numSpeakers);

// Generate with different speakers
for (let sid = 0; sid < numSpeakers; sid++) {
  const audio = await tts.generateSpeech('Hello from speaker ' + sid, { sid });
  await saveAudioToFile(audio, `/path/speaker_${sid}.wav`);
}

await tts.destroy();

Model Detection

Detect TTS model type without initializing:
import { detectTtsModel } from 'react-native-sherpa-onnx/tts';

const result = await detectTtsModel(
  { type: 'asset', path: 'models/vits-piper-en' }
);

if (result.success) {
  console.log('Detected type:', result.modelType);
  console.log('Models:', result.detectedModels);
  
  if (result.modelType === 'vits' || result.modelType === 'matcha') {
    // Show noise/length scale options in UI
  }
}

Performance Optimization

Threading

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper-en' },
  numThreads: 4,  // More threads = faster
});

Hardware Acceleration

import { getCoreMlSupport } from 'react-native-sherpa-onnx';

// iOS: Check Core ML support
const coremlSupport = await getCoreMlSupport();
if (coremlSupport.hasAccelerator) {
  const tts = await createTTS({
    modelPath: { type: 'asset', path: 'models/vits-piper-en' },
    provider: 'coreml',  // Use Apple Neural Engine
  });
}

Speed Control

Adjust speech speed at generation time:
// Slower speech
const slowAudio = await tts.generateSpeech('Hello', { speed: 0.75 });

// Faster speech
const fastAudio = await tts.generateSpeech('Hello', { speed: 1.5 });

Common Use Cases

Generate and Play

import Sound from 'react-native-sound';

const audio = await tts.generateSpeech('Hello, world!');

// Save temporarily
const tempPath = `${RNFS.CachesDirectoryPath}/temp.wav`;
await saveAudioToFile(audio, tempPath);

// Play
const sound = new Sound(tempPath, '', (error) => {
  if (!error) {
    sound.play();
  }
});

Batch Generation

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper-en' },
});

const phrases = [
  'Good morning',
  'How are you?',
  'Have a nice day',
];

for (const [index, phrase] of phrases.entries()) {
  const audio = await tts.generateSpeech(phrase);
  await saveAudioToFile(audio, `/path/phrase_${index}.wav`);
}

await tts.destroy();

Dynamic Speaker Selection

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/kokoro-multi' },
});

const numSpeakers = await tts.getNumSpeakers();

// User selects speaker
const selectedSpeaker = 2;

if (selectedSpeaker < numSpeakers) {
  const audio = await tts.generateSpeech(
    'User-entered text',
    { sid: selectedSpeaker }
  );
}

Troubleshooting

  • Verify model directory exists and contains required files
  • For VITS: need model.onnx, tokens.txt, espeak-ng-data (some models)
  • For Zipvoice: need encoder, decoder, vocoder, tokens, lexicon, espeak-ng-data
  • Try modelType: 'auto' for automatic detection
  • Enable debug: true for detailed logs
The full Zipvoice model (~605 MB) requires significant RAM:
  • Use the int8 distill variant: sherpa-onnx-zipvoice-distill-int8-zh-en-emilia (~104 MB)
  • Close other apps to free memory
  • Target devices with 8+ GB RAM for full model
  • Adjust noiseScale (VITS/Matcha): try 0.667-1.0
  • Adjust lengthScale: values close to 1.0 are more natural
  • Try a larger/better model
  • Increase numSteps for flow-matching models (Pocket)
Use the speed parameter at generation time:
const audio = await tts.generateSpeech('Text', { speed: 0.8 });  // Slower
Or adjust lengthScale in model options (permanent).
  • Ensure model supports voice cloning (Pocket, Zipvoice)
  • Reference audio should be 3-10 seconds, clear, mono
  • Provide accurate referenceText transcript
  • For Zipvoice, use generateSpeech() not streaming
  • Increase numSteps for better quality

Next Steps

Streaming TTS

Low-latency streaming generation

Model Setup

Learn how to bundle and load models

Speech-to-Text

Transcribe audio to text

Execution Providers

Hardware acceleration options

Build docs developers (and LLMs) love