QVAC Logo
How-to guides

Transcription

Automatic speech recognition (ASR) for speech-to-text — i.e., generate text transcriptions from audio input.

Overview

Transcription uses whisper.cpp as inference engine. Load any supported model using modelType: "whisper". Then, provide audio input as audioChunk, either as a file path (string) or an in-memory audio buffer.

Optionally (recommended), also pass an ancillary voice activity detection (VAD) model (e.g., Silero) to improve speech segmentation.

transcribe() returns the full transcription as a single string. If you need partial results as they become available, use transcribeStream() to receive text chunks in real-time.

Functions

Use the following sequence of function calls:

  1. loadModel()
  2. transcribe() or transcribeStream()
  3. unloadModel()

For how to use each function, see SDK — API reference.

Models

You should load two models:

  • a whisper.cpp-compatible model for transcription. Model file format: *.bin; and
  • a VAD model (e.g., Silero) converted to GGML. Model file format: *.bin (optional, recommended).

For models available as constants, see SDK — Models.

Example

The following script shows an example of transcription:

transcription.js
import { loadModel, unloadModel, transcribe, WHISPER_TINY } from "@qvac/sdk";
try {
    console.log("🎤 Starting Whisper transcription with prompt example...");
    // Load the Whisper model
    console.log("📥 Loading Whisper model...");
    const modelId = await loadModel({
        modelSrc: WHISPER_TINY,
        modelType: "whisper",
        modelConfig: {
            audio_format: "f32le",
            // Sampling strategy
            strategy: "greedy",
            n_threads: 4,
            // Transcription options
            language: "en",
            translate: false,
            no_timestamps: false,
            single_segment: false,
            print_timestamps: true,
            token_timestamps: true,
            // Quality settings
            temperature: 0.0,
            suppress_blank: true,
            suppress_nst: true,
            // Advanced tuning
            entropy_thold: 2.4,
            logprob_thold: -1.0,
            // VAD configuration
            vad_params: {
                threshold: 0.35,
                min_speech_duration_ms: 200,
                min_silence_duration_ms: 150,
                max_speech_duration_s: 30.0,
                speech_pad_ms: 600,
                samples_overlap: 0.3,
            },
            // Context parameters for GPU
            contextParams: {
                use_gpu: true,
                flash_attn: true,
                gpu_device: 0,
            },
        },
        onProgress: (progress) => {
            console.log(progress);
        },
    });
    console.log(`✅ Whisper model loaded with ID: ${modelId}`);
    // Perform transcription
    console.log("🎧 Transcribing audio...");
    const text = await transcribe({
        modelId,
        audioChunk: "examples/audio/sample-16khz.wav",
        prompt: "This is a test recording with clear speech and proper punctuation.",
    });
    console.log("📝 Transcription result:");
    console.log(text);
    // Unload the model when done
    console.log("🧹 Unloading Whisper model...");
    await unloadModel({ modelId });
    console.log("✅ Whisper model unloaded successfully");
    process.exit(0);
}
catch (error) {
    console.error("❌ Error:", error);
    process.exit(1);
}

Tip: all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see SDK quickstart.

On this page