QVAC Logo

@qvac/llm-llamacpp

LLM inference for text generation and chat with support to images, and other media within a single conversation context.

Overview

Bare module that adds support for text completion and multimodal prompts in QVAC using llama.cpp as the inference engine.

Models

You can load any llama.cpp-compatible text-generation/chat model. Model file format: *.gguf.

Requirement

Bare \geq v1.24

Installation

npm i @qvac/llm-llamacpp

Quickstart

If you don't have Bare runtime, install it:

npm i -g bare

Create a new project:

mkdir qvac-llm-quickstart
cd qvac-llm-quickstart
npm init -y

Install dependencies:

npm i @qvac/dl-filesystem @qvac/llm-llamacpp bare-process

Download a compatible model:

curl -L --create-dirs -o models/Llama-3.2-1B-Instruct-Q4_0.gguf \
  https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf

Create index.js:

index.js
'use strict'

const LlmLlamacpp = require('@qvac/llm-llamacpp')
const FilesystemDL = require('@qvac/dl-filesystem')
const process = require('bare-process')

async function main () {
  const modelName = 'Llama-3.2-1B-Instruct-Q4_0.gguf'
  const dirPath = './models'

  // 1. Initializing data loader
  const fsDL = new FilesystemDL({ dirPath })

  // 2. Configuring model settings
  const args = {
    loader: fsDL,
    opts: { stats: true },
    logger: console,
    diskPath: dirPath,
    modelName
  }

  const config = {
    device: 'gpu',
    gpu_layers: '999',
    ctx_size: '1024'
  }

  // 3. Loading model
  const model = new LlmLlamacpp(args, config)
  await model.load()

  try {
    // 4. Running inference with conversation prompt
    const prompt = [
      {
        role: 'system',
        content: 'You are a helpful, respectful and honest assistant.'
      },
      {
        role: 'user',
        content: 'what is bitcoin?'
      },
      {
        role: 'assistant',
        content: "It's a digital currency."
      },
      {
        role: 'user',
        content: 'Can you elaborate on the previous topic?'
      }
    ]

    const response = await model.run(prompt)
    let fullResponse = ''

    await response
      .onUpdate(data => {
        process.stdout.write(data)
        fullResponse += data
      })
      .await()

    console.log('\n')
    console.log('Full response:\n', fullResponse)
    console.log(`Inference stats: ${JSON.stringify(response.stats)}`)
  } catch (error) {
    const errorMessage = error?.message || error?.toString() || String(error)
    console.error('Error occurred:', errorMessage)
    console.error('Error details:', error)
  } finally {
    // 5. Cleaning up resources
    await model.unload()
    await fsDL.close()
  }
}

main().catch(error => {
  console.error('Fatal error in main function:', {
    error: error.message,
    stack: error.stack,
    timestamp: new Date().toISOString()
  })
  process.exit(1)
})

Run index.js:

bare index.js

Usage

1. Import the Model Class

const LlmLlamacpp = require('@qvac/llm-llamacpp')

2. Create a Data Loader

Data Loaders abstract the way model files are accessed. Use a FileSystemDataLoader to load model files from your local file system. Models can be downloaded directly from HuggingFace.

const FilesystemDL = require('@qvac/dl-filesystem')

const dirPath = './models'
const modelName = 'Llama-3.2-1B-Instruct-Q4_0.gguf'

const fsDL = new FilesystemDL({ dirPath })

3. Create the args obj

const args = {
  loader: fsDL,
  opts: { stats: true },
  logger: console,
  diskPath: dirPath,
  modelName,
  // projectionModel: 'mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf' // for multimodal support you need to pass the projection model name
}

The args obj contains the following properties:

  • loader: The Data Loader instance from which the model file will be streamed.
  • logger: This property is used to create a QvacLogger instance, which handles all logging functionality.
  • opts.stats: This flag determines whether to calculate inference stats.
  • diskPath: The local directory where the model file will be downloaded to.
  • modelName: The name of model file in the Data Loader.
  • projectionModel: The name of the projection model file in the Data Loader. This is required for multimodal support.

4. Create the config obj

The config obj consists of a set of hyper-parameters which can be used to tweak the behaviour of the model.
All parameters must by strings.

// an example of possible configuration
const config = {
  gpu_layers: '99', // number of model layers offloaded to GPU.
  ctx_size: '1024', // context length
  device: 'cpu' // must be specified: 'gpu' or 'cpu' else it will throw an error
}
ParameterRange / TypeDefaultDescription
device"gpu" or "cpu"— (required)Device to run inference on
gpu_layersinteger0Number of model layers to offload to GPU
ctx_size0 – model-dependent4096 (0 = loaded from model)Context window size
lorastringPath to LoRA adapter file
temp0.00 – 2.000.8Sampling temperature
top_p0 – 10.9Top-p (nucleus) sampling
top_k0 – 12840Top-k sampling
predictinteger (-1 = infinity)-1Maximum tokens to predict
seedinteger-1 (random)Random seed for sampling
no_mmap"" (passing empty string sets the flag)Disable memory mapping for model loading
reverse_promptstring (comma-separated)Stop generation when these strings are encountered
repeat_penaltyfloat1.1Repetition penalty
presence_penaltyfloat0Presence penalty for sampling
frequency_penaltyfloat0Frequency penalty for sampling
tools"true" or "false""false"Enable tool calling with jinja templating
verbosity0 – 3 (0=ERROR, 1=WARNING, 2=INFO, 3=DEBUG)0Logging verbosity level
n_discardedinteger0Tokens to discard in sliding window context
main-gpuinteger, "integrated", or "dedicated"GPU selection for multi-GPU systems

IGPU/GPU selection logic:

Scenariomain-gpu not specifiedmain-gpu: "dedicated"main-gpu: "integrated"
Devices consideredAll GPUs (dedicated + integrated)Only dedicated GPUsOnly integrated GPUs
System with iGPU only✅ Uses iGPU❌ Falls back to CPU✅ Uses iGPU
System with dedicated GPU only✅ Uses dedicated GPU✅ Uses dedicated GPU❌ Falls back to CPU
System with both✅ Uses dedicated GPU (preferred)✅ Uses dedicated GPU✅ Uses integrated GPU

5. Create Model Instance

const model = new LlmLlamacpp(args, config)

6. Load Model

await model.load()

Optionally you can pass the following parameters to tweak the loading behaviour.

  • close?: This boolean value determines whether to close the Data Loader after loading. Defaults to true
  • reportProgressCallback?: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.

For example:

await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))

Progress Callback Data

The progress callback receives an object with the following properties:

PropertyTypeDescription
actionstringCurrent operation being performed
totalSizenumberTotal bytes to be loaded
totalFilesnumberTotal number of files to process
filesProcessednumberNumber of files completed so far
currentFilestringName of file currently being processed
currentFileProgressstringPercentage progress on current file
overallProgressstringOverall loading progress percentage

7. Run Inference

Pass an array of messages (following the chat completion format) to the run method. Process the generated tokens asynchronously:

try {
  const messages = [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What is the capital of France?' }
  ]

  const response = await model.run(messages)
  const buffer = []

  // Option 1: Process streamed output using async iterator
  for await (const token of response.iterate()) {
    process.stdout.write(token) // Write token directly to output
    buffer.push(token)
  }

  // Option 2: Process streamed output using callback
  await response.onUpdate(token => { /* ... */ }).await()

  console.log('\n--- Full Response ---\n', buffer.join(''))

} catch (error) {
  console.error('Inference failed:', error)
}

8. Release Resources

Unload the model when finished:

try {
  await model.unload()
  await fsDL.close()
} catch (error) {
  console.error('Failed to unload model:', error)
}

More resources

Package at npm

On this page