@qvac/llm-llamacpp

LLM inference for text generation and chat with support to images, and other media within a single conversation context.

Overview

Bare module that adds support for text completion and multimodal prompts in QVAC using llama.cpp as the inference engine.

Models

You can load any llama.cpp-compatible text-generation/chat model. Model file format: *.gguf.

Requirement

Bare $\geq$ v1.24

Installation

npm i @qvac/llm-llamacpp

Quickstart

If you don't have Bare runtime, install it:

npm i -g bare

Create a new project:

mkdir qvac-llm-quickstart
cd qvac-llm-quickstart
npm init -y

Install dependencies:

npm i @qvac/dl-filesystem @qvac/llm-llamacpp bare-process

Download a compatible model:

curl -L --create-dirs -o models/Llama-3.2-1B-Instruct-Q4_0.gguf \
  https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf

Create index.js:

index.js

'use strict'

const LlmLlamacpp = require('@qvac/llm-llamacpp')
const FilesystemDL = require('@qvac/dl-filesystem')
const process = require('bare-process')

async function main () {
  const modelName = 'Llama-3.2-1B-Instruct-Q4_0.gguf'
  const dirPath = './models'

  // 1. Initializing data loader
  const fsDL = new FilesystemDL({ dirPath })

  // 2. Configuring model settings
  const args = {
    loader: fsDL,
    opts: { stats: true },
    logger: console,
    diskPath: dirPath,
    modelName
  }

  const config = {
    device: 'gpu',
    gpu_layers: '999',
    ctx_size: '1024'
  }

  // 3. Loading model
  const model = new LlmLlamacpp(args, config)
  await model.load()

  try {
    // 4. Running inference with conversation prompt
    const prompt = [
      {
        role: 'system',
        content: 'You are a helpful, respectful and honest assistant.'
      },
      {
        role: 'user',
        content: 'what is bitcoin?'
      },
      {
        role: 'assistant',
        content: "It's a digital currency."
      },
      {
        role: 'user',
        content: 'Can you elaborate on the previous topic?'
      }
    ]

    const response = await model.run(prompt)
    let fullResponse = ''

    await response
      .onUpdate(data => {
        process.stdout.write(data)
        fullResponse += data
      })
      .await()

    console.log('\n')
    console.log('Full response:\n', fullResponse)
    console.log(`Inference stats: ${JSON.stringify(response.stats)}`)
  } catch (error) {
    const errorMessage = error?.message || error?.toString() || String(error)
    console.error('Error occurred:', errorMessage)
    console.error('Error details:', error)
  } finally {
    // 5. Cleaning up resources
    await model.unload()
    await fsDL.close()
  }
}

main().catch(error => {
  console.error('Fatal error in main function:', {
    error: error.message,
    stack: error.stack,
    timestamp: new Date().toISOString()
  })
  process.exit(1)
})

Run index.js:

bare index.js

Usage

1. Import the Model Class

const LlmLlamacpp = require('@qvac/llm-llamacpp')

2. Create a Data Loader

Data Loaders abstract the way model files are accessed. Use a FileSystemDataLoader to load model files from your local file system. Models can be downloaded directly from HuggingFace.

const FilesystemDL = require('@qvac/dl-filesystem')

const dirPath = './models'
const modelName = 'Llama-3.2-1B-Instruct-Q4_0.gguf'

const fsDL = new FilesystemDL({ dirPath })

3. Create the `args` obj

const args = {
  loader: fsDL,
  opts: { stats: true },
  logger: console,
  diskPath: dirPath,
  modelName,
  // projectionModel: 'mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf' // for multimodal support you need to pass the projection model name
}

The args obj contains the following properties:

loader: The Data Loader instance from which the model file will be streamed.
logger: This property is used to create a QvacLogger instance, which handles all logging functionality.
opts.stats: This flag determines whether to calculate inference stats.
diskPath: The local directory where the model file will be downloaded to.
modelName: The name of model file in the Data Loader.
projectionModel: The name of the projection model file in the Data Loader. This is required for multimodal support.

4. Create the `config` obj

The config obj consists of a set of hyper-parameters which can be used to tweak the behaviour of the model.
All parameters must by strings.

// an example of possible configuration
const config = {
  gpu_layers: '99', // number of model layers offloaded to GPU.
  ctx_size: '1024', // context length
  device: 'cpu' // must be specified: 'gpu' or 'cpu' else it will throw an error
}

Parameter	Range / Type	Default	Description
device	`"gpu"` or `"cpu"`	— (required)	Device to run inference on
gpu_layers	integer	0	Number of model layers to offload to GPU
ctx_size	0 – model-dependent	4096 (0 = loaded from model)	Context window size
lora	string	—	Path to LoRA adapter file
temp	0.00 – 2.00	0.8	Sampling temperature
top_p	0 – 1	0.9	Top-p (nucleus) sampling
top_k	0 – 128	40	Top-k sampling
predict	integer (-1 = infinity)	-1	Maximum tokens to predict
seed	integer	-1 (random)	Random seed for sampling
no_mmap	"" (passing empty string sets the flag)	—	Disable memory mapping for model loading
reverse_prompt	string (comma-separated)	—	Stop generation when these strings are encountered
repeat_penalty	float	1.1	Repetition penalty
presence_penalty	float	0	Presence penalty for sampling
frequency_penalty	float	0	Frequency penalty for sampling
tools	`"true"` or `"false"`	`"false"`	Enable tool calling with jinja templating
verbosity	0 – 3 (0=ERROR, 1=WARNING, 2=INFO, 3=DEBUG)	0	Logging verbosity level
n_discarded	integer	0	Tokens to discard in sliding window context
main-gpu	integer, `"integrated"`, or `"dedicated"`	—	GPU selection for multi-GPU systems

IGPU/GPU selection logic:

Scenario	main-gpu not specified	main-gpu: `"dedicated"`	main-gpu: `"integrated"`
Devices considered	All GPUs (dedicated + integrated)	Only dedicated GPUs	Only integrated GPUs
System with iGPU only	✅ Uses iGPU	❌ Falls back to CPU	✅ Uses iGPU
System with dedicated GPU only	✅ Uses dedicated GPU	✅ Uses dedicated GPU	❌ Falls back to CPU
System with both	✅ Uses dedicated GPU (preferred)	✅ Uses dedicated GPU	✅ Uses integrated GPU

5. Create Model Instance

const model = new LlmLlamacpp(args, config)

6. Load Model

await model.load()

Optionally you can pass the following parameters to tweak the loading behaviour.

close?: This boolean value determines whether to close the Data Loader after loading. Defaults to true
reportProgressCallback?: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.

For example:

await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))

Progress Callback Data

The progress callback receives an object with the following properties:

Property	Type	Description
`action`	string	Current operation being performed
`totalSize`	number	Total bytes to be loaded
`totalFiles`	number	Total number of files to process
`filesProcessed`	number	Number of files completed so far
`currentFile`	string	Name of file currently being processed
`currentFileProgress`	string	Percentage progress on current file
`overallProgress`	string	Overall loading progress percentage

7. Run Inference

Pass an array of messages (following the chat completion format) to the run method. Process the generated tokens asynchronously:

try {
  const messages = [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What is the capital of France?' }
  ]

  const response = await model.run(messages)
  const buffer = []

  // Option 1: Process streamed output using async iterator
  for await (const token of response.iterate()) {
    process.stdout.write(token) // Write token directly to output
    buffer.push(token)
  }

  // Option 2: Process streamed output using callback
  await response.onUpdate(token => { /* ... */ }).await()

  console.log('\n--- Full Response ---\n', buffer.join(''))

} catch (error) {
  console.error('Inference failed:', error)
}

8. Release Resources

Unload the model when finished:

try {
  await model.unload()
  await fsDL.close()
} catch (error) {
  console.error('Failed to unload model:', error)
}

More resources

Package at npm

@qvac/llm-llamacpp

On this page