@qvac/tts-onnx

Overview

Bare module that adds support for text-to-speech in QVAC using ONNX runtime as the inference engine.

Models

You can load any Chatterbox model bundle compatible with ONNX Runtime. Required files: tokenizer (*.json) + speech encoder, embed tokens, conditional decoder, and language model (*.onnx).

Installation

npm i @qvac/tts-onnx

Quickstart

If you don't have Bare runtime, install it:

npm i -g bare

Create a new project:

mkdir qvac-tts-quickstart
cd qvac-tts-quickstart
npm init -y

Install dependencies:

npm i @qvac/tts-onnx bare-fs bare-path

Place the Chatterbox model files into models/chatterbox/: tokenizer.json, speech_encoder.onnx, embed_tokens.onnx, conditional_decoder.onnx, language_model.onnx. Also place a reference WAV file (for voice cloning) at ./reference.wav.

Create index.js:

index.js

'use strict'

const fs = require('bare-fs')
const path = require('bare-path')
const ONNXTTS = require('@qvac/tts-onnx')
const { setLogger, releaseLogger } = require('@qvac/tts-onnx/addonLogging')

const CHATTERBOX_SAMPLE_RATE = 24000

const tokenizerPath = 'models/chatterbox/tokenizer.json'
const speechEncoderPath = 'models/chatterbox/speech_encoder.onnx'
const embedTokensPath = 'models/chatterbox/embed_tokens.onnx'
const conditionalDecoderPath = 'models/chatterbox/conditional_decoder.onnx'
const languageModelPath = 'models/chatterbox/language_model.onnx'

const refWavPath = path.resolve('./reference.wav')

async function main () {
  setLogger((priority, message) => {
    const priorityNames = {
      0: 'ERROR',
      1: 'WARNING',
      2: 'INFO',
      3: 'DEBUG',
      4: 'OFF'
    }
    const priorityName = priorityNames[priority] || 'UNKNOWN'
    const timestamp = new Date().toISOString()
    console.log(`[${timestamp}] [C++ log] [${priorityName}]: ${message}`)
  })

  // Load reference audio (16-bit PCM WAV)
  const wavBuf = fs.readFileSync(refWavPath)
  const dataOffset = 44 // standard WAV header size
  const int16 = new Int16Array(wavBuf.buffer, wavBuf.byteOffset + dataOffset, (wavBuf.length - dataOffset) / 2)
  const referenceAudio = new Float32Array(int16.length)
  for (let i = 0; i < int16.length; i++) referenceAudio[i] = int16[i] / 32768

  // Chatterbox configuration
  const chatterboxArgs = {
    tokenizerPath,
    speechEncoderPath,
    embedTokensPath,
    conditionalDecoderPath,
    languageModelPath,
    referenceAudio,
    opts: { stats: true },
    logger: console
  }

  const config = {
    language: 'en'
  }

  const model = new ONNXTTS(chatterboxArgs, config)

  try {
    console.log('Loading Chatterbox TTS model...')
    await model.load()
    console.log('Model loaded.')

    const textToSynthesize = 'Hello world! This is a test of the Chatterbox TTS system.'
    console.log(`Running TTS on: "${textToSynthesize}"`)

    const response = await model.run({
      input: textToSynthesize,
      type: 'text'
    })

    console.log('Waiting for TTS results...')
    let buffer = []

    await response
      .onUpdate(data => {
        if (data && data.outputArray) {
          buffer = buffer.concat(Array.from(data.outputArray))
        }
      })
      .await()

    console.log('TTS finished!')
    if (response.stats) {
      console.log(`Inference stats: ${JSON.stringify(response.stats)}`)
    }

    console.log(`Generated ${buffer.length} audio samples at ${CHATTERBOX_SAMPLE_RATE}Hz`)
  } catch (err) {
    console.error('Error during TTS processing:', err)
  } finally {
    console.log('Unloading model...')
    await model.unload()
    console.log('Model unloaded.')
    releaseLogger()
  }
}

main().catch(console.error)

Run index.js:

bare index.js

Usage

1. Import the Model Class

const { ONNXTTS } = require('@qvac/tts-onnx')

2. Create a Data Loader

Data Loaders abstract the way model files are accessed. It is recommended to utilize a HyperdriveDataLoader to stream the model file(s) from a hyperdrive. Optionally, you could use a FileSystemDataLoader to stream the model file(s) from your local file system.

const store = new Corestore('./store')
const hdStore = store.namespace('hd')

// see examples folder for existing keys
const hdDL = new HyperDriveDL({
  key: 'hd://your-hyperdrive-key-here',
  store: hdStore
})

3. Create the `args` obj

const args = {
  loader: hdDL,
  opts: { stats: true },
  logger: console,
  cache: './models/',
  tokenizerPath: 'chatterbox/tokenizer.json',
  speechEncoderPath: 'chatterbox/speech_encoder.onnx',
  embedTokensPath: 'chatterbox/embed_tokens.onnx',
  conditionalDecoderPath: 'chatterbox/conditional_decoder.onnx',
  languageModelPath: 'chatterbox/language_model.onnx',
  referenceAudio: referenceAudioFloat32Array
}

The args obj contains the following properties:

loader: The Data Loader instance from which the model files will be streamed.
logger: This property is used to create logging functionality.
opts.stats: This flag determines whether to calculate inference stats.
cache: The local directory where the model files will be downloaded to.
tokenizerPath: Path to the Chatterbox tokenizer JSON file.
speechEncoderPath: Path to the speech encoder ONNX model.
embedTokensPath: Path to the embed tokens ONNX model.
conditionalDecoderPath: Path to the conditional decoder ONNX model.
languageModelPath: Path to the language model ONNX model.
referenceAudio: Float32Array of reference audio samples for voice cloning.

4. Create the `config` obj

The config obj consists of a set of parameters which can be used to tweak the behaviour of the TTS model.

const config = {
  language: 'en',
  useGPU: true,
}

Parameter	Type	Default	Description
language	string	'en'	Language code (ISO 639-1 format)
useGPU	boolean	false	Enable GPU acceleration based on EP provider

5. Create Model Instance

const model = new ONNXTTS(args, config)

6. Load Model

await model.load()

Optionally you can pass the following parameters to tweak the loading behaviour.

closeLoader?: This boolean value determines whether to close the Data Loader after loading. Defaults to true
reportProgressCallback?: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.

For example:

await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))

Progress Callback Data

The progress callback receives an object with the following properties:

Property	Type	Description
`action`	string	Current operation being performed
`totalSize`	number	Total bytes to be loaded
`totalFiles`	number	Total number of files to process
`filesProcessed`	number	Number of files completed so far
`currentFile`	string	Name of file currently being processed
`currentFileProgress`	string	Percentage progress on current file
`overallProgress`	string	Overall loading progress percentage

7. Run TTS Synthesis

Pass the text to synthesize to the run method. Process the generated audio output asynchronously:

try {
  const textToSynthesize = 'Hello world! This is a test of the TTS system.'
  let audioSamples = []

  const response = await model.run({
    input: textToSynthesize,
    type: 'text'
  })

  // Process output using callback to collect audio samples
  await response
    .onUpdate(data => {
      if (data.outputArray) {
        // Collect raw PCM audio samples
        const samples = Array.from(data.outputArray)
        audioSamples = audioSamples.concat(samples)
        console.log(`Received ${samples.length} audio samples`)
      }
      if (data.event === 'JobEnded') {
        console.log('TTS synthesis completed:', data.stats)
      }
    })
    .await() // Wait for the entire process to complete

  console.log(`Total audio samples generated: ${audioSamples.length}`)
  
  // audioSamples now contains the complete audio as PCM data (16-bit, 16kHz, mono)
  // You can create WAV files, stream to audio APIs, etc.

  // Access performance stats if enabled
  if (response.stats) {
    console.log(`Inference stats: ${JSON.stringify(response.stats)}`)
  }

} catch (error) {
  console.error('TTS synthesis failed:', error)
}

8. Release Resources

Unload the model when finished:

try {
  await model.unload()
  // Close P2P resources if applicable
} catch (error) {
  console.error('Failed to unload model:', error)
}

// Audio output event - contains only the raw PCM data
{
  outputArray: Int16Array([1234, -567, 890, -123, ...]) // 16-bit PCM samples
}

2. Job Completion Events

When synthesis completes, performance statistics are provided:

// Job completion event - contains performance statistics
{
  totalTime: 0.624621926,              // Total processing time in seconds
  tokensPerSecond: 219.33267837286903, // Processing speed
  realTimeFactor: 0.05818013468703428, // Real-time performance factor. Less than 1 means that streaming is possible
  audioDurationMs: 10736,              // Generated audio duration in milliseconds
  totalSamples: 171776                 // Total number of audio samples generated
}

Audio Format Specifications:

Sample Rate: 24000 Hz
Format: 16-bit signed PCM, mono channel
Data Type: Int16Array containing raw audio samples

Working with Audio Data

Here's how to collect and process the audio output:

let audioSamples = []

const response = await model.run({
  input: 'Your text to synthesize',
  type: 'text'
})

await response
  .onUpdate(data => {
    if (data.outputArray) {
      // Check if this is an audio output event
      const samples = Array.from(data.outputArray)
      audioSamples = audioSamples.concat(samples)
      console.log(`Received ${samples.length} audio samples`)
    } else {
      // This is a completion event with statistics
      console.log('TTS completed with stats:', data)
    }
  })
  .await()

// audioSamples now contains all PCM samples as 16-bit integers
// Sample rate: 24000 Hz, Format: mono PCM
console.log(`Total audio samples generated: ${audioSamples.length}`)

More resources

Package at npm