QVAC Logo

@qvac/transcription-whispercpp

Automatic speech recognition (ASR) for speech-to-text.

Overview

Bare module that adds support for transcription in QVAC using whisper.cpp as the inference engine.

Models

You should load two models:

  • a whisper.cpp-compatible model for transcription. Model file format: *.bin; and
  • a VAD model (e.g., Silero) converted to GGML. Model file format: *.bin (optional, recommended).

Requirement

Bare \geq v1.24

Installation

npm i @qvac/transcription-whispercpp

Quickstart

If you don't have Bare runtime, install it:

npm i -g bare

Create a new project:

mkdir qvac-transcription-quickstart
cd qvac-transcription-quickstart
npm init -y

Install dependencies:

npm i @qvac/dl-filesystem @qvac/transcription-whispercpp bare-fs bare-process

Download models and place them in models/:

  • A Whisper model (e.g., ggml-tiny.bin) from Hugging Face
  • (Optional) A Silero VAD model (ggml-silero-v5.1.2.bin)

Create index.js:

index.js
'use strict'

const fs = require('bare-fs')
const process = require('bare-process')
const TranscriptionWhispercpp = require('@qvac/transcription-whispercpp')
const FilesystemDL = require('@qvac/dl-filesystem')

async function main () {
  const modelName = 'ggml-tiny.bin'
  const dirPath = './models'
  const audioFilePath = './my-audio.raw'

  // 1. Initializing data loader
  const fsDL = new FilesystemDL({ dirPath })

  // 2. Constructor arguments
  const constructorArgs = {
    modelName,
    loader: fsDL,
    diskPath: dirPath
  }

  // 3. Configuration object
  const config = {
    opts: { stats: true },
    whisperConfig: {
      audio_format: 's16le',
      vad_model_path: './models/ggml-silero-v5.1.2.bin',
      vad_params: {
        threshold: 0.35,
        min_speech_duration_ms: 200,
        min_silence_duration_ms: 150,
        max_speech_duration_s: 30,
        speech_pad_ms: 600,
        samples_overlap: 0.3
      },
      language: ''
    }
  }

  // 4. Loading model
  const model = new TranscriptionWhispercpp(constructorArgs, config)
  await model.load()

  // 5. Running transcription
  const bitRate = 128000
  const bytesPerSecond = bitRate / 8
  const audioStream = fs.createReadStream(audioFilePath, { highWaterMark: bytesPerSecond })

  const response = await model.run(audioStream)

  const full = []
  response.onUpdate((outputArr) => {
    const items = Array.isArray(outputArr) ? outputArr : [outputArr]
    const last = items[items.length - 1]
    if (last && last.text) console.log('[onUpdate]', last.start, '→', last.end, last.text)
  })

  for await (const output of response.iterate()) {
    const items = Array.isArray(output) ? output : [output]
    full.push(...items)
  }

  if (full.length) {
    const text = full.map(s => s.text).join(' ').trim()
    console.log('\n=== TRANSCRIPTION ===')
    console.log(text)
    console.log('=====================\n')
  } else {
    console.log('No transcription output received.')
  }

  // 6. Cleaning up resources
  await model.destroy()
  await fsDL.close()
}

main().catch(err => {
  console.error(err)
  process.exit(1)
})

Run index.js:

bare index.js

Usage

1. Choose a Data Loader

First, select and instantiate a data loader that provides access to model files:

// Option A: Filesystem Data Loader - for local model files
const FilesystemDL = require('@qvac/dl-filesystem')
const fsDL = new FilesystemDL({
  dirPath: './path/to/model/files' // Directory containing model weights and settings
})

// Option B: Hyperdrive Data Loader - for peer-to-peer distributed models
const HyperDriveDL = require('@qvac/dl-hyperdrive')
// Key comes from the Model Registry (see below)
const hdDL = new HyperDriveDL({
  key: 'hd://<driveKey>',  // Hyperdrive key containing model files
  store: corestore        // (Optional) A Corestore instance, If not provided, the Hyperdrive will use an in-memory store.
})

2. Configure Transcription Parameters

Most users interact with the addon exclusively through index.js. From that entrypoint we surface a small, safe subset of options; everything else keeps whisper.cpp defaults.

What index.js accepts

SectionKeyDescription
contextParamsmodelAbsolute or relative path to the .bin whisper model
(all other context keys keep their defaults because changing them forces a full reload, see below)
whisperConfig(any whisper_full_params key)Forwarded untouched. We surface convenience defaults in index.js, but every whisper.cpp flag is accepted
miscConfigcaption_enabledFormats segments with <|start|>..<|end|> markers

Context keys that force a full reload

Internally WhisperModel::configContextIsChanged() watches model, use_gpu, flash_attn and gpu_device. If any of these change we must:

  1. Call unload() (destroys the current whisper_context and whisper_state).
  2. Recreate the context via whisper_init_from_file_with_params.
  3. Warm up the model again before the next job.

Depending on model size this can take several seconds. Everything else in whisperConfig—language, temperatures, VAD settings, etc.—is applied in place and does not trigger a reload. If you are seeing unexpected pauses, double-check that you are not mutating these four context keys between jobs.

Advanced configuration

Need more than the handful of options exposed in index.js? The upstream whisper.cpp documentation lists every flag available through whisper_full_params. Rather than duplicating that matrix here, refer to:

  • The official parameter reference: whisper_full_params
  • Our longer examples for concrete shapes:
    • examples/example.audio-ctx-chunking.js (shows offset_ms, duration_ms, audio_ctx, and reload loops)
    • examples/example.live-transcription.js (shows streaming chunks into a single job)

Those scripts stay in sync with the codebase and are the best place to copy from when you need the raw addon surface.

3. Configuration Example

Quick JS-level configuration (what you typically pass to new TranscriptionWhispercpp(...)):

const config = {
  contextParams: {
    model: './models/ggml-tiny.bin'
  },
  whisperConfig: {
    language: 'en',
    duration_ms: 0,
    temperature: 0.0,
    suppress_nst: true,
    n_threads: 0,
    vad_model_path: './models/ggml-silero-v5.1.2.bin',
    vadParams: {
      threshold: 0.6,
      min_speech_duration_ms: 250,
      min_silence_duration_ms: 200
    }
  },
  miscConfig: {
    caption_enabled: false
  }
}

Between this minimal configuration and the example scripts you should have everything needed, whether you are wiring the addon by hand or just instantiating TranscriptionWhispercpp.

Available Whisper Models:

  • ggml-tiny.bin - Smallest, fastest (39MB)
  • ggml-base.bin - Balanced size/accuracy (142MB)
  • ggml-small.bin - Better accuracy (466MB)
  • ggml-medium.bin - High accuracy (1.5GB)
  • ggml-large.bin - Best accuracy (3.1GB)

VAD Model:

  • ggml-silero-v5.1.2.bin - Silero VAD model for voice activity detection

Ensure model files are available in your chosen data loader source.

4. Create Model Instance

Import the specific Whisper model class based on the installed package and instantiate it:

const TranscriptionWhispercpp = require('@qvac/transcription-whispercpp')

const model = new TranscriptionWhispercpp(args, config)

Note : This import changes depending on the package installed.

5. Load Model

Load the model weights and initialize the inference engine. Optionally provide a callback for progress updates:

try {
  // Basic usage
  await model.load()

  // Advanced usage with progress tracking
  await model.load(
          false,  // Don't close loader after loading
          (progress) => console.log(`Loading: ${progress.overallProgress}% complete`)
  )
} catch (error) {
  console.error('Failed to load model:', error)
}

Progress Callback Data

The progress callback receives an object with the following properties:

PropertyTypeDescription
actionstringCurrent operation being performed
totalSizenumberTotal bytes to be loaded
totalFilesnumberTotal number of files to process
filesProcessednumberNumber of files completed so far
currentFilestringName of file currently being processed
currentFileProgressstringPercentage progress on current file
overallProgressstringOverall loading progress percentage

6. Run Transcription

Pass an audio stream (e.g., from bare-fs.createReadStream) to the run method. Process the transcription results asynchronously.

There are two ways to receive transcription results:

Option 1: Real-time Streaming with onUpdate()

The onUpdate() callback receives each transcription segment in real-time as whisper.cpp generates them during processing. This is ideal for live transcription display or progressive updates.

try {
  const audioStream = fs.createReadStream('path/to/your/audio.ogg', {
    highWaterMark: 16000 // Adjust based on bitrate (e.g., 128000 / 8)
  })

  const response = await model.run(audioStream)

  // Receive segments as they are transcribed (real-time streaming)
  await response
          .onUpdate(segment => {
            console.log('New segment transcribed:', segment)
            // Each segment arrives immediately after whisper.cpp processes it
          })
          .await() // Wait for transcription to complete

  console.log('Transcription finished!')

} catch (error) {
  console.error('Transcription failed:', error)
}

Option 2: Complete Result with iterate()

The iterate() method returns all transcription segments after the entire transcription completes. This is useful when you need the full result before processing.

try {
  const audioStream = fs.createReadStream('path/to/your/audio.ogg', {
    highWaterMark: 16000
  })

  const response = await model.run(audioStream)

  // Wait for complete transcription, then iterate over all segments
  for await (const transcriptionChunk of response.iterate()) {
    console.log('Transcription chunk:', transcriptionChunk)
  }

  console.log('Transcription finished!')

} catch (error) {
  console.error('Transcription failed:', error)
}

Key Differences:

  • onUpdate(): Real-time streaming - segments arrive as they are generated by whisper.cpp's new_segment_callback
  • iterate(): Batch processing - all segments available after transcription completes

Chunking long recordings with reload()

examples/example.audio-ctx-chunking.js shows the production pattern: reuse a model instance, call reload() with { offset_ms, duration_ms, audio_ctx } per chunk (first chunk uses audio_ctx = 0, subsequent ones clamp to ~1500), then run the full audio stream. The matching integration test (test/integration/audio-ctx-chunking.test.js) exercises exactly the same flow.

Live streaming a single job

examples/example.live-transcription.js feeds tiny PCM buffers into a pushable Readable, keeps a single model.run(...) open, and relies on onUpdate() for incremental text. test/integration/live-stream-simulation.test.js covers both the streaming case and a segmented loop without any reload() calls.

7. Release Resources

Always unload the model when finished to free up memory and resources:

try {
  await model.unload()
  // If using Hyperdrive/Hyperbee, close the db instance if applicable
  await db.close()
} catch (error) {
  console.error('Failed to unload model:', error)
}

Decoder + VAD + Whisper Integration AddOn

This package combines audio decoding, optional VAD trimming, and Whisper transcription into a single TranscriptionFfmpegAddon. It automatically:

  1. Decodes or ingests raw PCM/encoded audio
  2. (Optionally) applies Silero VAD to drop non-speech
  3. Feeds speech segments to Whisper for transcription

The principles are the same than for the single Whisper addon but with some differences in the configuration interface.

Usage

Import TranscriptionFfmpegAddon from the transcription-ffmpeg.js module:

const TranscriptionFfmpegAddon = require('@qvac/transcription-whispercpp/transcription-ffmpeg')

Configuration

When you instantiate TranscriptionFfmpegAddon, pass:

  • loader: your data loader instance
  • params.decoder.audioFormat: one of
    • 'decoded' (raw PCM input - for pre-decoded audio files)
    • 'encoded' | 's16le' | 'f32le' | 'mp3' | 'wav' | 'm4a' (for encoded audio files)
  • params.decoder.streamIndex: stream index of the media file (default: 0)
  • params.decoder.inputBitrate: bitrate of the media file in bps (used to calculate buffer size)

Usage Example

See examples/example.ffmpeg.js for a full working script that demonstrates the FFmpeg decoder + Whisper transcription pipeline with encoded audio files (MP3, etc.).

Additional Features

  • Progress Tracking: Monitor loading progress with callbacks
  • Performance Stats: Measure inference time with the stats option

More resources

Package at npm

On this page