QVAC Logo

@qvac/tts-onnx

Speech synthesis for text-to-speech (TTS).

Overview

Bare module that adds support for text-to-speech in QVAC using ONNX runtime as the inference engine.

Models

You can load any Chatterbox model bundle compatible with ONNX Runtime. Required files: tokenizer (*.json) + speech encoder, embed tokens, conditional decoder, and language model (*.onnx).

Requirement

Bare \geq v1.24

Installation

npm i @qvac/tts-onnx

Quickstart

If you don't have Bare runtime, install it:

npm i -g bare

Create a new project:

mkdir qvac-tts-quickstart
cd qvac-tts-quickstart
npm init -y

Install dependencies:

npm i @qvac/tts-onnx bare-fs bare-path

Place the Chatterbox model files into models/chatterbox/: tokenizer.json, speech_encoder.onnx, embed_tokens.onnx, conditional_decoder.onnx, language_model.onnx. Also place a reference WAV file (for voice cloning) at ./reference.wav.

Create index.js:

index.js
'use strict'

const fs = require('bare-fs')
const path = require('bare-path')
const ONNXTTS = require('@qvac/tts-onnx')
const { setLogger, releaseLogger } = require('@qvac/tts-onnx/addonLogging')

const CHATTERBOX_SAMPLE_RATE = 24000

const tokenizerPath = 'models/chatterbox/tokenizer.json'
const speechEncoderPath = 'models/chatterbox/speech_encoder.onnx'
const embedTokensPath = 'models/chatterbox/embed_tokens.onnx'
const conditionalDecoderPath = 'models/chatterbox/conditional_decoder.onnx'
const languageModelPath = 'models/chatterbox/language_model.onnx'

const refWavPath = path.resolve('./reference.wav')

async function main () {
  setLogger((priority, message) => {
    const priorityNames = {
      0: 'ERROR',
      1: 'WARNING',
      2: 'INFO',
      3: 'DEBUG',
      4: 'OFF'
    }
    const priorityName = priorityNames[priority] || 'UNKNOWN'
    const timestamp = new Date().toISOString()
    console.log(`[${timestamp}] [C++ log] [${priorityName}]: ${message}`)
  })

  // Load reference audio (16-bit PCM WAV)
  const wavBuf = fs.readFileSync(refWavPath)
  const dataOffset = 44 // standard WAV header size
  const int16 = new Int16Array(wavBuf.buffer, wavBuf.byteOffset + dataOffset, (wavBuf.length - dataOffset) / 2)
  const referenceAudio = new Float32Array(int16.length)
  for (let i = 0; i < int16.length; i++) referenceAudio[i] = int16[i] / 32768

  // Chatterbox configuration
  const chatterboxArgs = {
    tokenizerPath,
    speechEncoderPath,
    embedTokensPath,
    conditionalDecoderPath,
    languageModelPath,
    referenceAudio,
    opts: { stats: true },
    logger: console
  }

  const config = {
    language: 'en'
  }

  const model = new ONNXTTS(chatterboxArgs, config)

  try {
    console.log('Loading Chatterbox TTS model...')
    await model.load()
    console.log('Model loaded.')

    const textToSynthesize = 'Hello world! This is a test of the Chatterbox TTS system.'
    console.log(`Running TTS on: "${textToSynthesize}"`)

    const response = await model.run({
      input: textToSynthesize,
      type: 'text'
    })

    console.log('Waiting for TTS results...')
    let buffer = []

    await response
      .onUpdate(data => {
        if (data && data.outputArray) {
          buffer = buffer.concat(Array.from(data.outputArray))
        }
      })
      .await()

    console.log('TTS finished!')
    if (response.stats) {
      console.log(`Inference stats: ${JSON.stringify(response.stats)}`)
    }

    console.log(`Generated ${buffer.length} audio samples at ${CHATTERBOX_SAMPLE_RATE}Hz`)
  } catch (err) {
    console.error('Error during TTS processing:', err)
  } finally {
    console.log('Unloading model...')
    await model.unload()
    console.log('Model unloaded.')
    releaseLogger()
  }
}

main().catch(console.error)

Run index.js:

bare index.js

Usage

1. Import the Model Class

const { ONNXTTS } = require('@qvac/tts-onnx')

2. Create a Data Loader

Data Loaders abstract the way model files are accessed. It is recommended to utilize a HyperdriveDataLoader to stream the model file(s) from a hyperdrive. Optionally, you could use a FileSystemDataLoader to stream the model file(s) from your local file system.

const store = new Corestore('./store')
const hdStore = store.namespace('hd')

// see examples folder for existing keys
const hdDL = new HyperDriveDL({
  key: 'hd://your-hyperdrive-key-here',
  store: hdStore
})

3. Create the args obj

const args = {
  loader: hdDL,
  opts: { stats: true },
  logger: console,
  cache: './models/',
  tokenizerPath: 'chatterbox/tokenizer.json',
  speechEncoderPath: 'chatterbox/speech_encoder.onnx',
  embedTokensPath: 'chatterbox/embed_tokens.onnx',
  conditionalDecoderPath: 'chatterbox/conditional_decoder.onnx',
  languageModelPath: 'chatterbox/language_model.onnx',
  referenceAudio: referenceAudioFloat32Array
}

The args obj contains the following properties:

  • loader: The Data Loader instance from which the model files will be streamed.
  • logger: This property is used to create logging functionality.
  • opts.stats: This flag determines whether to calculate inference stats.
  • cache: The local directory where the model files will be downloaded to.
  • tokenizerPath: Path to the Chatterbox tokenizer JSON file.
  • speechEncoderPath: Path to the speech encoder ONNX model.
  • embedTokensPath: Path to the embed tokens ONNX model.
  • conditionalDecoderPath: Path to the conditional decoder ONNX model.
  • languageModelPath: Path to the language model ONNX model.
  • referenceAudio: Float32Array of reference audio samples for voice cloning.

4. Create the config obj

The config obj consists of a set of parameters which can be used to tweak the behaviour of the TTS model.

const config = {
  language: 'en',
  useGPU: true,
}
ParameterTypeDefaultDescription
languagestring'en'Language code (ISO 639-1 format)
useGPUbooleanfalseEnable GPU acceleration based on EP provider

5. Create Model Instance

const model = new ONNXTTS(args, config)

6. Load Model

await model.load()

Optionally you can pass the following parameters to tweak the loading behaviour.

  • closeLoader?: This boolean value determines whether to close the Data Loader after loading. Defaults to true
  • reportProgressCallback?: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.

For example:

await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))

Progress Callback Data

The progress callback receives an object with the following properties:

PropertyTypeDescription
actionstringCurrent operation being performed
totalSizenumberTotal bytes to be loaded
totalFilesnumberTotal number of files to process
filesProcessednumberNumber of files completed so far
currentFilestringName of file currently being processed
currentFileProgressstringPercentage progress on current file
overallProgressstringOverall loading progress percentage

7. Run TTS Synthesis

Pass the text to synthesize to the run method. Process the generated audio output asynchronously:

try {
  const textToSynthesize = 'Hello world! This is a test of the TTS system.'
  let audioSamples = []

  const response = await model.run({
    input: textToSynthesize,
    type: 'text'
  })

  // Process output using callback to collect audio samples
  await response
    .onUpdate(data => {
      if (data.outputArray) {
        // Collect raw PCM audio samples
        const samples = Array.from(data.outputArray)
        audioSamples = audioSamples.concat(samples)
        console.log(`Received ${samples.length} audio samples`)
      }
      if (data.event === 'JobEnded') {
        console.log('TTS synthesis completed:', data.stats)
      }
    })
    .await() // Wait for the entire process to complete

  console.log(`Total audio samples generated: ${audioSamples.length}`)
  
  // audioSamples now contains the complete audio as PCM data (16-bit, 16kHz, mono)
  // You can create WAV files, stream to audio APIs, etc.

  // Access performance stats if enabled
  if (response.stats) {
    console.log(`Inference stats: ${JSON.stringify(response.stats)}`)
  }

} catch (error) {
  console.error('TTS synthesis failed:', error)
}

8. Release Resources

Unload the model when finished:

try {
  await model.unload()
  // Close P2P resources if applicable
} catch (error) {
  console.error('Failed to unload model:', error)
}

Output Format

The output is received via the onUpdate callback of the response object. The TTS system provides raw audio data in the form of PCM samples.

Output Events

The system generates different types of events during TTS synthesis:

1. Audio Output Events

When audio data is available, the callback receives raw PCM samples:

// Audio output event - contains only the raw PCM data
{
  outputArray: Int16Array([1234, -567, 890, -123, ...]) // 16-bit PCM samples
}

2. Job Completion Events

When synthesis completes, performance statistics are provided:

// Job completion event - contains performance statistics
{
  totalTime: 0.624621926,              // Total processing time in seconds
  tokensPerSecond: 219.33267837286903, // Processing speed
  realTimeFactor: 0.05818013468703428, // Real-time performance factor. Less than 1 means that streaming is possible
  audioDurationMs: 10736,              // Generated audio duration in milliseconds
  totalSamples: 171776                 // Total number of audio samples generated
}

Audio Format Specifications:

  • Sample Rate: 24000 Hz
  • Format: 16-bit signed PCM, mono channel
  • Data Type: Int16Array containing raw audio samples

Working with Audio Data

Here's how to collect and process the audio output:

let audioSamples = []

const response = await model.run({
  input: 'Your text to synthesize',
  type: 'text'
})

await response
  .onUpdate(data => {
    if (data.outputArray) {
      // Check if this is an audio output event
      const samples = Array.from(data.outputArray)
      audioSamples = audioSamples.concat(samples)
      console.log(`Received ${samples.length} audio samples`)
    } else {
      // This is a completion event with statistics
      console.log('TTS completed with stats:', data)
    }
  })
  .await()

// audioSamples now contains all PCM samples as 16-bit integers
// Sample rate: 24000 Hz, Format: mono PCM
console.log(`Total audio samples generated: ${audioSamples.length}`)

More resources

Package at npm

On this page