@qvac/embed-llamacpp
Vector embedding generation for semantic search, clustering, and retrieval that seamlessly supports retrieval-augmented generation workflow.
Overview
Bare module that adds support for text embeddings and RAG in QVAC using llama.cpp as the inference engine.
Models
You can load any llama.cpp-compatible embeddings model. Model file format: *.gguf.
Requirement
Bare v1.24
Installation
npm i @qvac/embed-llamacppQuickstart
If you don't have Bare runtime, install it:
npm i -g bareCreate a new project:
mkdir qvac-embed-quickstart
cd qvac-embed-quickstart
npm init -yInstall dependencies:
npm i @qvac/dl-filesystem @qvac/embed-llamacppDownload a compatible model:
curl -L --create-dirs -o models/gte-large_fp16.gguf \
https://huggingface.co/ChristianAzinn/gte-large-gguf/resolve/main/gte-large_fp16.ggufCreate index.js:
'use strict'
const FilesystemDL = require('@qvac/dl-filesystem')
const GGMLBert = require('@qvac/embed-llamacpp')
async function main () {
const modelName = 'gte-large_fp16.gguf'
const dirPath = './models'
// 1. Initializing data loader
const fsDL = new FilesystemDL({ dirPath })
// 2. Configuring model settings
const args = {
loader: fsDL,
logger: console,
opts: { stats: true },
diskPath: dirPath,
modelName
}
const config = '-ngl\t25'
// 3. Loading model
const model = new GGMLBert(args, config)
await model.load()
try {
// 4. Generating embeddings
const query = 'Hello, can you suggest a game I can play with my 1 year old daughter?'
const response = await model.run(query)
const embeddings = await response.await()
console.log('Embeddings shape:', embeddings.length, 'x', embeddings[0].length)
console.log('First few values of first embedding:')
console.log(embeddings[0].slice(0, 5))
} catch (error) {
const errorMessage = error?.message || error?.toString() || String(error)
console.error('Error occurred:', errorMessage)
console.error('Error details:', error)
} finally {
// 5. Cleaning up resources
await model.unload()
await fsDL.close()
}
}
main().catch(console.error)Run index.js:
bare index.jsUsage
1. Import the Model Class
const GGMLBert = require('@qvac/embed-llamacpp')2. Create a Data Loader
Data Loaders abstract the way model files are accessed. Use a FileSystemDataLoader to load model files from your local file system. Models can be downloaded directly from HuggingFace.
const FilesystemDL = require('@qvac/dl-filesystem')
const dirPath = './models'
const modelName = 'gte-large_fp16.gguf'
const fsDL = new FilesystemDL({ dirPath })3. Create the args obj
const args = {
loader: fsDL,
logger: console,
opts: { stats: true },
diskPath: dirPath,
modelName
}The args obj contains the following properties:
loader: The Data Loader instance from which the model file will be streamed.logger: This property is used to create aQvacLoggerinstance, which handles all logging functionality.opts.stats: This flag determines whether to calculate inference stats.diskPath: The local directory where the model file will be downloaded to.modelName: The name of model file in the Data Loader.
4. Create config
The config is a string consisting of a set of hyper-parameters which can be used to tweak the behaviour of the model.
Each parameter is separated by a tab (\t) from its value, and different parameters are separated by newlines (\n).
// an example of possible configuration
const config = '-ngl\t99\n--batch-size\t1024\n-dev\tgpu'| Parameter | Range / Type | Default | Description |
|---|---|---|---|
| -dev | "gpu" or "cpu" | "gpu" | Device to run inference on |
| -ngl | integer | 0 | Number of model layers to offload to GPU |
| --batch-size | integer | 2048 | Tokens for processing multiple prompts together |
| --pooling | {none,mean,cls,last,rank} | model default | Pooling type for embeddings |
| --attention | {causal,non-causal} | model default | Attention type for embeddings |
| --embd-normalize | integer | 2 | Embedding normalization (-1=none, 0=max abs int16, 1=taxicab, 2=euclidean, >2=p-norm) |
| -fa | "on", "off", or "auto" | "auto" | Enable/disable flash attention |
| --main-gpu | integer, "integrated", or "dedicated" | — | GPU selection for multi-GPU systems |
| verbosity | 0 – 3 (0=ERROR, 1=WARNING, 2=INFO, 3=DEBUG) | 0 | Logging verbosity level |
IGPU/GPU selection logic:
| Scenario | main-gpu not specified | main-gpu: "dedicated" | main-gpu: "integrated" |
|---|---|---|---|
| Devices considered | All GPUs (dedicated + integrated) | Only dedicated GPUs | Only integrated GPUs |
| System with iGPU only | ✅ Uses iGPU | ❌ Falls back to CPU | ✅ Uses iGPU |
| System with dedicated GPU only | ✅ Uses dedicated GPU | ✅ Uses dedicated GPU | ❌ Falls back to CPU |
| System with both | ✅ Uses dedicated GPU (preferred) | ✅ Uses dedicated GPU | ✅ Uses integrated GPU |
5. Instantiate the model
const model = new GGMLBert(args, config)6. Load the model
await model.load()Optionally you can pass the following parameters to tweak the loading behaviour.
close?: This boolean value determines whether to close the Data Loader after loading. Defaults totruereportProgressCallback?: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.
For example:
await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))Progress Callback Data
The progress callback receives an object with the following properties:
| Property | Type | Description |
|---|---|---|
action | string | Current operation being performed |
totalSize | number | Total bytes to be loaded |
totalFiles | number | Total number of files to process |
filesProcessed | number | Number of files completed so far |
currentFile | string | Name of file currently being processed |
currentFileProgress | string | Percentage progress on current file |
overallProgress | string | Overall loading progress percentage |
7. Generate embeddings for input sequence
The model outputs a vector for the input sequence.
const query = 'Hello, can you suggest a game I can play with my 1 year old daughter?'
const response = await model.run(query)
const embeddings = await response.await()8. Release Resources
Unload the model when finished:
try {
await model.unload()
await fsDL.close()
} catch (error) {
console.error('Failed to unload model:', error)
}