@qvac/transcription-whispercpp
Automatic speech recognition (ASR) for speech-to-text.
Overview
Bare module that adds support for transcription in QVAC using whisper.cpp as the inference engine.
Models
You should load two models:
- a
whisper.cpp-compatible model for transcription. Model file format:*.bin; and - a VAD model (e.g., Silero) converted to GGML. Model file format:
*.bin(optional, recommended).
Requirement
Bare v1.24
Installation
npm i @qvac/transcription-whispercppQuickstart
If you don't have Bare runtime, install it:
npm i -g bareCreate a new project:
mkdir qvac-transcription-quickstart
cd qvac-transcription-quickstart
npm init -yInstall dependencies:
npm i @qvac/dl-filesystem @qvac/transcription-whispercpp bare-fs bare-processDownload models and place them in models/:
- A Whisper model (e.g.,
ggml-tiny.bin) from Hugging Face - (Optional) A Silero VAD model (
ggml-silero-v5.1.2.bin)
Create index.js:
'use strict'
const fs = require('bare-fs')
const process = require('bare-process')
const TranscriptionWhispercpp = require('@qvac/transcription-whispercpp')
const FilesystemDL = require('@qvac/dl-filesystem')
async function main () {
const modelName = 'ggml-tiny.bin'
const dirPath = './models'
const audioFilePath = './my-audio.raw'
// 1. Initializing data loader
const fsDL = new FilesystemDL({ dirPath })
// 2. Constructor arguments
const constructorArgs = {
modelName,
loader: fsDL,
diskPath: dirPath
}
// 3. Configuration object
const config = {
opts: { stats: true },
whisperConfig: {
audio_format: 's16le',
vad_model_path: './models/ggml-silero-v5.1.2.bin',
vad_params: {
threshold: 0.35,
min_speech_duration_ms: 200,
min_silence_duration_ms: 150,
max_speech_duration_s: 30,
speech_pad_ms: 600,
samples_overlap: 0.3
},
language: ''
}
}
// 4. Loading model
const model = new TranscriptionWhispercpp(constructorArgs, config)
await model.load()
// 5. Running transcription
const bitRate = 128000
const bytesPerSecond = bitRate / 8
const audioStream = fs.createReadStream(audioFilePath, { highWaterMark: bytesPerSecond })
const response = await model.run(audioStream)
const full = []
response.onUpdate((outputArr) => {
const items = Array.isArray(outputArr) ? outputArr : [outputArr]
const last = items[items.length - 1]
if (last && last.text) console.log('[onUpdate]', last.start, '→', last.end, last.text)
})
for await (const output of response.iterate()) {
const items = Array.isArray(output) ? output : [output]
full.push(...items)
}
if (full.length) {
const text = full.map(s => s.text).join(' ').trim()
console.log('\n=== TRANSCRIPTION ===')
console.log(text)
console.log('=====================\n')
} else {
console.log('No transcription output received.')
}
// 6. Cleaning up resources
await model.destroy()
await fsDL.close()
}
main().catch(err => {
console.error(err)
process.exit(1)
})Run index.js:
bare index.jsUsage
1. Choose a Data Loader
First, select and instantiate a data loader that provides access to model files:
// Option A: Filesystem Data Loader - for local model files
const FilesystemDL = require('@qvac/dl-filesystem')
const fsDL = new FilesystemDL({
dirPath: './path/to/model/files' // Directory containing model weights and settings
})
// Option B: Hyperdrive Data Loader - for peer-to-peer distributed models
const HyperDriveDL = require('@qvac/dl-hyperdrive')
// Key comes from the Model Registry (see below)
const hdDL = new HyperDriveDL({
key: 'hd://<driveKey>', // Hyperdrive key containing model files
store: corestore // (Optional) A Corestore instance, If not provided, the Hyperdrive will use an in-memory store.
})2. Configure Transcription Parameters
Most users interact with the addon exclusively through index.js. From that entrypoint we surface a small, safe subset of options; everything else keeps whisper.cpp defaults.
What index.js accepts
| Section | Key | Description |
|---|---|---|
contextParams | model | Absolute or relative path to the .bin whisper model |
| (all other context keys keep their defaults because changing them forces a full reload, see below) | ||
whisperConfig | (any whisper_full_params key) | Forwarded untouched. We surface convenience defaults in index.js, but every whisper.cpp flag is accepted |
miscConfig | caption_enabled | Formats segments with <|start|>..<|end|> markers |
Context keys that force a full reload
Internally WhisperModel::configContextIsChanged() watches model, use_gpu, flash_attn and gpu_device. If any of these change we must:
- Call
unload()(destroys the currentwhisper_contextandwhisper_state). - Recreate the context via
whisper_init_from_file_with_params. - Warm up the model again before the next job.
Depending on model size this can take several seconds. Everything else in whisperConfig—language, temperatures, VAD settings, etc.—is applied in place and does not trigger a reload. If you are seeing unexpected pauses, double-check that you are not mutating these four context keys between jobs.
Advanced configuration
Need more than the handful of options exposed in index.js? The upstream whisper.cpp documentation lists every flag available through whisper_full_params. Rather than duplicating that matrix here, refer to:
- The official parameter reference:
whisper_full_params - Our longer examples for concrete shapes:
examples/example.audio-ctx-chunking.js(showsoffset_ms,duration_ms,audio_ctx, and reload loops)examples/example.live-transcription.js(shows streaming chunks into a single job)
Those scripts stay in sync with the codebase and are the best place to copy from when you need the raw addon surface.
3. Configuration Example
Quick JS-level configuration (what you typically pass to new TranscriptionWhispercpp(...)):
const config = {
contextParams: {
model: './models/ggml-tiny.bin'
},
whisperConfig: {
language: 'en',
duration_ms: 0,
temperature: 0.0,
suppress_nst: true,
n_threads: 0,
vad_model_path: './models/ggml-silero-v5.1.2.bin',
vadParams: {
threshold: 0.6,
min_speech_duration_ms: 250,
min_silence_duration_ms: 200
}
},
miscConfig: {
caption_enabled: false
}
}Between this minimal configuration and the example scripts you should have everything needed, whether you are wiring the addon by hand or just instantiating TranscriptionWhispercpp.
Available Whisper Models:
ggml-tiny.bin- Smallest, fastest (39MB)ggml-base.bin- Balanced size/accuracy (142MB)ggml-small.bin- Better accuracy (466MB)ggml-medium.bin- High accuracy (1.5GB)ggml-large.bin- Best accuracy (3.1GB)
VAD Model:
ggml-silero-v5.1.2.bin- Silero VAD model for voice activity detection
Ensure model files are available in your chosen data loader source.
4. Create Model Instance
Import the specific Whisper model class based on the installed package and instantiate it:
const TranscriptionWhispercpp = require('@qvac/transcription-whispercpp')
const model = new TranscriptionWhispercpp(args, config)Note : This import changes depending on the package installed.
5. Load Model
Load the model weights and initialize the inference engine. Optionally provide a callback for progress updates:
try {
// Basic usage
await model.load()
// Advanced usage with progress tracking
await model.load(
false, // Don't close loader after loading
(progress) => console.log(`Loading: ${progress.overallProgress}% complete`)
)
} catch (error) {
console.error('Failed to load model:', error)
}Progress Callback Data
The progress callback receives an object with the following properties:
| Property | Type | Description |
|---|---|---|
action | string | Current operation being performed |
totalSize | number | Total bytes to be loaded |
totalFiles | number | Total number of files to process |
filesProcessed | number | Number of files completed so far |
currentFile | string | Name of file currently being processed |
currentFileProgress | string | Percentage progress on current file |
overallProgress | string | Overall loading progress percentage |
6. Run Transcription
Pass an audio stream (e.g., from bare-fs.createReadStream) to the run method. Process the transcription results asynchronously.
There are two ways to receive transcription results:
Option 1: Real-time Streaming with onUpdate()
The onUpdate() callback receives each transcription segment in real-time as whisper.cpp generates them during processing. This is ideal for live transcription display or progressive updates.
try {
const audioStream = fs.createReadStream('path/to/your/audio.ogg', {
highWaterMark: 16000 // Adjust based on bitrate (e.g., 128000 / 8)
})
const response = await model.run(audioStream)
// Receive segments as they are transcribed (real-time streaming)
await response
.onUpdate(segment => {
console.log('New segment transcribed:', segment)
// Each segment arrives immediately after whisper.cpp processes it
})
.await() // Wait for transcription to complete
console.log('Transcription finished!')
} catch (error) {
console.error('Transcription failed:', error)
}Option 2: Complete Result with iterate()
The iterate() method returns all transcription segments after the entire transcription completes. This is useful when you need the full result before processing.
try {
const audioStream = fs.createReadStream('path/to/your/audio.ogg', {
highWaterMark: 16000
})
const response = await model.run(audioStream)
// Wait for complete transcription, then iterate over all segments
for await (const transcriptionChunk of response.iterate()) {
console.log('Transcription chunk:', transcriptionChunk)
}
console.log('Transcription finished!')
} catch (error) {
console.error('Transcription failed:', error)
}Key Differences:
onUpdate(): Real-time streaming - segments arrive as they are generated by whisper.cpp'snew_segment_callbackiterate(): Batch processing - all segments available after transcription completes
Chunking long recordings with reload()
examples/example.audio-ctx-chunking.js shows the production pattern: reuse a model instance, call reload() with { offset_ms, duration_ms, audio_ctx } per chunk (first chunk uses audio_ctx = 0, subsequent ones clamp to ~1500), then run the full audio stream. The matching integration test (test/integration/audio-ctx-chunking.test.js) exercises exactly the same flow.
Live streaming a single job
examples/example.live-transcription.js feeds tiny PCM buffers into a pushable Readable, keeps a single model.run(...) open, and relies on onUpdate() for incremental text. test/integration/live-stream-simulation.test.js covers both the streaming case and a segmented loop without any reload() calls.
7. Release Resources
Always unload the model when finished to free up memory and resources:
try {
await model.unload()
// If using Hyperdrive/Hyperbee, close the db instance if applicable
await db.close()
} catch (error) {
console.error('Failed to unload model:', error)
}Decoder + VAD + Whisper Integration AddOn
This package combines audio decoding, optional VAD trimming, and Whisper transcription into a single TranscriptionFfmpegAddon. It automatically:
- Decodes or ingests raw PCM/encoded audio
- (Optionally) applies Silero VAD to drop non-speech
- Feeds speech segments to Whisper for transcription
The principles are the same than for the single Whisper addon but with some differences in the configuration interface.
Usage
Import TranscriptionFfmpegAddon from the transcription-ffmpeg.js module:
const TranscriptionFfmpegAddon = require('@qvac/transcription-whispercpp/transcription-ffmpeg')Configuration
When you instantiate TranscriptionFfmpegAddon, pass:
loader: your data loader instanceparams.decoder.audioFormat: one of'decoded'(raw PCM input - for pre-decoded audio files)'encoded'|'s16le'|'f32le'|'mp3'|'wav'|'m4a'(for encoded audio files)
params.decoder.streamIndex: stream index of the media file (default: 0)params.decoder.inputBitrate: bitrate of the media file in bps (used to calculate buffer size)
Usage Example
See examples/example.ffmpeg.js for a full working script that demonstrates the FFmpeg decoder + Whisper transcription pipeline with encoded audio files (MP3, etc.).
Additional Features
- Progress Tracking: Monitor loading progress with callbacks
- Performance Stats: Measure inference time with the
statsoption