@qvac/llm-llamacpp
LLM inference for text generation and chat with support to images, and other media within a single conversation context.
Overview
Bare module that adds support for text completion and multimodal prompts in QVAC using llama.cpp as the inference engine.
Models
You can load any llama.cpp-compatible text-generation/chat model. Model file format: *.gguf.
Requirement
Bare v1.24
Installation
npm i @qvac/llm-llamacppQuickstart
If you don't have Bare runtime, install it:
npm i -g bareCreate a new project:
mkdir qvac-llm-quickstart
cd qvac-llm-quickstart
npm init -yInstall dependencies:
npm i @qvac/dl-filesystem @qvac/llm-llamacpp bare-processDownload a compatible model:
curl -L --create-dirs -o models/Llama-3.2-1B-Instruct-Q4_0.gguf \
https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.ggufCreate index.js:
'use strict'
const LlmLlamacpp = require('@qvac/llm-llamacpp')
const FilesystemDL = require('@qvac/dl-filesystem')
const process = require('bare-process')
async function main () {
const modelName = 'Llama-3.2-1B-Instruct-Q4_0.gguf'
const dirPath = './models'
// 1. Initializing data loader
const fsDL = new FilesystemDL({ dirPath })
// 2. Configuring model settings
const args = {
loader: fsDL,
opts: { stats: true },
logger: console,
diskPath: dirPath,
modelName
}
const config = {
device: 'gpu',
gpu_layers: '999',
ctx_size: '1024'
}
// 3. Loading model
const model = new LlmLlamacpp(args, config)
await model.load()
try {
// 4. Running inference with conversation prompt
const prompt = [
{
role: 'system',
content: 'You are a helpful, respectful and honest assistant.'
},
{
role: 'user',
content: 'what is bitcoin?'
},
{
role: 'assistant',
content: "It's a digital currency."
},
{
role: 'user',
content: 'Can you elaborate on the previous topic?'
}
]
const response = await model.run(prompt)
let fullResponse = ''
await response
.onUpdate(data => {
process.stdout.write(data)
fullResponse += data
})
.await()
console.log('\n')
console.log('Full response:\n', fullResponse)
console.log(`Inference stats: ${JSON.stringify(response.stats)}`)
} catch (error) {
const errorMessage = error?.message || error?.toString() || String(error)
console.error('Error occurred:', errorMessage)
console.error('Error details:', error)
} finally {
// 5. Cleaning up resources
await model.unload()
await fsDL.close()
}
}
main().catch(error => {
console.error('Fatal error in main function:', {
error: error.message,
stack: error.stack,
timestamp: new Date().toISOString()
})
process.exit(1)
})Run index.js:
bare index.jsUsage
1. Import the Model Class
const LlmLlamacpp = require('@qvac/llm-llamacpp')2. Create a Data Loader
Data Loaders abstract the way model files are accessed. Use a FileSystemDataLoader to load model files from your local file system. Models can be downloaded directly from HuggingFace.
const FilesystemDL = require('@qvac/dl-filesystem')
const dirPath = './models'
const modelName = 'Llama-3.2-1B-Instruct-Q4_0.gguf'
const fsDL = new FilesystemDL({ dirPath })3. Create the args obj
const args = {
loader: fsDL,
opts: { stats: true },
logger: console,
diskPath: dirPath,
modelName,
// projectionModel: 'mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf' // for multimodal support you need to pass the projection model name
}The args obj contains the following properties:
loader: The Data Loader instance from which the model file will be streamed.logger: This property is used to create aQvacLoggerinstance, which handles all logging functionality.opts.stats: This flag determines whether to calculate inference stats.diskPath: The local directory where the model file will be downloaded to.modelName: The name of model file in the Data Loader.projectionModel: The name of the projection model file in the Data Loader. This is required for multimodal support.
4. Create the config obj
The config obj consists of a set of hyper-parameters which can be used to tweak the behaviour of the model.
All parameters must by strings.
// an example of possible configuration
const config = {
gpu_layers: '99', // number of model layers offloaded to GPU.
ctx_size: '1024', // context length
device: 'cpu' // must be specified: 'gpu' or 'cpu' else it will throw an error
}| Parameter | Range / Type | Default | Description |
|---|---|---|---|
| device | "gpu" or "cpu" | — (required) | Device to run inference on |
| gpu_layers | integer | 0 | Number of model layers to offload to GPU |
| ctx_size | 0 – model-dependent | 4096 (0 = loaded from model) | Context window size |
| lora | string | — | Path to LoRA adapter file |
| temp | 0.00 – 2.00 | 0.8 | Sampling temperature |
| top_p | 0 – 1 | 0.9 | Top-p (nucleus) sampling |
| top_k | 0 – 128 | 40 | Top-k sampling |
| predict | integer (-1 = infinity) | -1 | Maximum tokens to predict |
| seed | integer | -1 (random) | Random seed for sampling |
| no_mmap | "" (passing empty string sets the flag) | — | Disable memory mapping for model loading |
| reverse_prompt | string (comma-separated) | — | Stop generation when these strings are encountered |
| repeat_penalty | float | 1.1 | Repetition penalty |
| presence_penalty | float | 0 | Presence penalty for sampling |
| frequency_penalty | float | 0 | Frequency penalty for sampling |
| tools | "true" or "false" | "false" | Enable tool calling with jinja templating |
| verbosity | 0 – 3 (0=ERROR, 1=WARNING, 2=INFO, 3=DEBUG) | 0 | Logging verbosity level |
| n_discarded | integer | 0 | Tokens to discard in sliding window context |
| main-gpu | integer, "integrated", or "dedicated" | — | GPU selection for multi-GPU systems |
IGPU/GPU selection logic:
| Scenario | main-gpu not specified | main-gpu: "dedicated" | main-gpu: "integrated" |
|---|---|---|---|
| Devices considered | All GPUs (dedicated + integrated) | Only dedicated GPUs | Only integrated GPUs |
| System with iGPU only | ✅ Uses iGPU | ❌ Falls back to CPU | ✅ Uses iGPU |
| System with dedicated GPU only | ✅ Uses dedicated GPU | ✅ Uses dedicated GPU | ❌ Falls back to CPU |
| System with both | ✅ Uses dedicated GPU (preferred) | ✅ Uses dedicated GPU | ✅ Uses integrated GPU |
5. Create Model Instance
const model = new LlmLlamacpp(args, config)6. Load Model
await model.load()Optionally you can pass the following parameters to tweak the loading behaviour.
close?: This boolean value determines whether to close the Data Loader after loading. Defaults totruereportProgressCallback?: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.
For example:
await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))Progress Callback Data
The progress callback receives an object with the following properties:
| Property | Type | Description |
|---|---|---|
action | string | Current operation being performed |
totalSize | number | Total bytes to be loaded |
totalFiles | number | Total number of files to process |
filesProcessed | number | Number of files completed so far |
currentFile | string | Name of file currently being processed |
currentFileProgress | string | Percentage progress on current file |
overallProgress | string | Overall loading progress percentage |
7. Run Inference
Pass an array of messages (following the chat completion format) to the run method. Process the generated tokens asynchronously:
try {
const messages = [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is the capital of France?' }
]
const response = await model.run(messages)
const buffer = []
// Option 1: Process streamed output using async iterator
for await (const token of response.iterate()) {
process.stdout.write(token) // Write token directly to output
buffer.push(token)
}
// Option 2: Process streamed output using callback
await response.onUpdate(token => { /* ... */ }).await()
console.log('\n--- Full Response ---\n', buffer.join(''))
} catch (error) {
console.error('Inference failed:', error)
}8. Release Resources
Unload the model when finished:
try {
await model.unload()
await fsDL.close()
} catch (error) {
console.error('Failed to unload model:', error)
}