@qvac/embed-llamacpp

Vector embedding generation for semantic search, clustering, and retrieval that seamlessly supports retrieval-augmented generation workflow.

Overview

Bare module that adds support for text embeddings and RAG in QVAC using llama.cpp as the inference engine.

Models

You can load any llama.cpp-compatible embeddings model. Model file format: *.gguf.

Installation

npm i @qvac/embed-llamacpp

Quickstart

If you don't have Bare runtime, install it:

npm i -g bare

Create a new project:

mkdir qvac-embed-quickstart
cd qvac-embed-quickstart
npm init -y

Install dependencies:

npm i @qvac/dl-filesystem @qvac/embed-llamacpp

Download a compatible model:

curl -L --create-dirs -o models/gte-large_fp16.gguf \
  https://huggingface.co/ChristianAzinn/gte-large-gguf/resolve/main/gte-large_fp16.gguf

Create index.js:

index.js

'use strict'

const FilesystemDL = require('@qvac/dl-filesystem')
const GGMLBert = require('@qvac/embed-llamacpp')

async function main () {
  const modelName = 'gte-large_fp16.gguf'
  const dirPath = './models'

  // 1. Initializing data loader
  const fsDL = new FilesystemDL({ dirPath })

  // 2. Configuring model settings
  const args = {
    loader: fsDL,
    logger: console,
    opts: { stats: true },
    diskPath: dirPath,
    modelName
  }
  const config = '-ngl\t25'

  // 3. Loading model
  const model = new GGMLBert(args, config)
  await model.load()

  try {
    // 4. Generating embeddings
    const query = 'Hello, can you suggest a game I can play with my 1 year old daughter?'
    const response = await model.run(query)
    const embeddings = await response.await()

    console.log('Embeddings shape:', embeddings.length, 'x', embeddings[0].length)
    console.log('First few values of first embedding:')
    console.log(embeddings[0].slice(0, 5))
  } catch (error) {
    const errorMessage = error?.message || error?.toString() || String(error)
    console.error('Error occurred:', errorMessage)
    console.error('Error details:', error)
  } finally {
    // 5. Cleaning up resources
    await model.unload()
    await fsDL.close()
  }
}

main().catch(console.error)

Run index.js:

bare index.js

Usage

1. Import the Model Class

const GGMLBert = require('@qvac/embed-llamacpp')

2. Create a Data Loader

Data Loaders abstract the way model files are accessed. Use a FileSystemDataLoader to load model files from your local file system. Models can be downloaded directly from HuggingFace.

const FilesystemDL = require('@qvac/dl-filesystem')

const dirPath = './models'
const modelName = 'gte-large_fp16.gguf'

const fsDL = new FilesystemDL({ dirPath })

3. Create the `args` obj

const args = {
  loader: fsDL,
  logger: console,
  opts: { stats: true },
  diskPath: dirPath,
  modelName
}

The args obj contains the following properties:

loader: The Data Loader instance from which the model file will be streamed.
logger: This property is used to create a QvacLogger instance, which handles all logging functionality.
opts.stats: This flag determines whether to calculate inference stats.
diskPath: The local directory where the model file will be downloaded to.
modelName: The name of model file in the Data Loader.

The config is a string consisting of a set of hyper-parameters which can be used to tweak the behaviour of the model.
Each parameter is separated by a tab (\t) from its value, and different parameters are separated by newlines (\n).

// an example of possible configuration
const config = '-ngl\t99\n--batch-size\t1024\n-dev\tgpu'

Parameter	Range / Type	Default	Description
-dev	`"gpu"` or `"cpu"`	`"gpu"`	Device to run inference on
-ngl	integer	0	Number of model layers to offload to GPU
--batch-size	integer	2048	Tokens for processing multiple prompts together
--pooling	`{none,mean,cls,last,rank}`	model default	Pooling type for embeddings
--attention	`{causal,non-causal}`	model default	Attention type for embeddings
--embd-normalize	integer	2	Embedding normalization (-1=none, 0=max abs int16, 1=taxicab, 2=euclidean, >2=p-norm)
-fa	`"on"`, `"off"`, or `"auto"`	`"auto"`	Enable/disable flash attention
--main-gpu	integer, `"integrated"`, or `"dedicated"`	—	GPU selection for multi-GPU systems
verbosity	0 – 3 (0=ERROR, 1=WARNING, 2=INFO, 3=DEBUG)	0	Logging verbosity level

IGPU/GPU selection logic:

Scenario	main-gpu not specified	main-gpu: `"dedicated"`	main-gpu: `"integrated"`
Devices considered	All GPUs (dedicated + integrated)	Only dedicated GPUs	Only integrated GPUs
System with iGPU only	✅ Uses iGPU	❌ Falls back to CPU	✅ Uses iGPU
System with dedicated GPU only	✅ Uses dedicated GPU	✅ Uses dedicated GPU	❌ Falls back to CPU
System with both	✅ Uses dedicated GPU (preferred)	✅ Uses dedicated GPU	✅ Uses integrated GPU

5. Instantiate the model

const model = new GGMLBert(args, config)

6. Load the model

await model.load()

Optionally you can pass the following parameters to tweak the loading behaviour.

close?: This boolean value determines whether to close the Data Loader after loading. Defaults to true
reportProgressCallback?: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.

For example:

await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))

Progress Callback Data

The progress callback receives an object with the following properties:

Property	Type	Description
`action`	string	Current operation being performed
`totalSize`	number	Total bytes to be loaded
`totalFiles`	number	Total number of files to process
`filesProcessed`	number	Number of files completed so far
`currentFile`	string	Name of file currently being processed
`currentFileProgress`	string	Percentage progress on current file
`overallProgress`	string	Overall loading progress percentage

7. Generate embeddings for input sequence

The model outputs a vector for the input sequence.

const query = 'Hello, can you suggest a game I can play with my 1 year old daughter?'
const response = await model.run(query)
const embeddings = await response.await()

8. Release Resources

Unload the model when finished:

try {
  await model.unload()
  await fsDL.close()
} catch (error) {
  console.error('Failed to unload model:', error)
}

More resources

Package at npm