QVAC Logo

@qvac/embed-llamacpp

Vector embedding generation for semantic search, clustering, and retrieval that seamlessly supports retrieval-augmented generation workflow.

Overview

Bare module that adds support for text embeddings and RAG in QVAC using llama.cpp as the inference engine.

Models

You can load any llama.cpp-compatible embeddings model. Model file format: *.gguf.

Requirement

Bare \geq v1.24

Installation

npm i @qvac/embed-llamacpp

Quickstart

If you don't have Bare runtime, install it:

npm i -g bare

Create a new project:

mkdir qvac-embed-quickstart
cd qvac-embed-quickstart
npm init -y

Install dependencies:

npm i @qvac/dl-filesystem @qvac/embed-llamacpp

Download a compatible model:

curl -L --create-dirs -o models/gte-large_fp16.gguf \
  https://huggingface.co/ChristianAzinn/gte-large-gguf/resolve/main/gte-large_fp16.gguf

Create index.js:

index.js
'use strict'

const FilesystemDL = require('@qvac/dl-filesystem')
const GGMLBert = require('@qvac/embed-llamacpp')

async function main () {
  const modelName = 'gte-large_fp16.gguf'
  const dirPath = './models'

  // 1. Initializing data loader
  const fsDL = new FilesystemDL({ dirPath })

  // 2. Configuring model settings
  const args = {
    loader: fsDL,
    logger: console,
    opts: { stats: true },
    diskPath: dirPath,
    modelName
  }
  const config = '-ngl\t25'

  // 3. Loading model
  const model = new GGMLBert(args, config)
  await model.load()

  try {
    // 4. Generating embeddings
    const query = 'Hello, can you suggest a game I can play with my 1 year old daughter?'
    const response = await model.run(query)
    const embeddings = await response.await()

    console.log('Embeddings shape:', embeddings.length, 'x', embeddings[0].length)
    console.log('First few values of first embedding:')
    console.log(embeddings[0].slice(0, 5))
  } catch (error) {
    const errorMessage = error?.message || error?.toString() || String(error)
    console.error('Error occurred:', errorMessage)
    console.error('Error details:', error)
  } finally {
    // 5. Cleaning up resources
    await model.unload()
    await fsDL.close()
  }
}

main().catch(console.error)

Run index.js:

bare index.js

Usage

1. Import the Model Class

const GGMLBert = require('@qvac/embed-llamacpp')

2. Create a Data Loader

Data Loaders abstract the way model files are accessed. Use a FileSystemDataLoader to load model files from your local file system. Models can be downloaded directly from HuggingFace.

const FilesystemDL = require('@qvac/dl-filesystem')

const dirPath = './models'
const modelName = 'gte-large_fp16.gguf'

const fsDL = new FilesystemDL({ dirPath })

3. Create the args obj

const args = {
  loader: fsDL,
  logger: console,
  opts: { stats: true },
  diskPath: dirPath,
  modelName
}

The args obj contains the following properties:

  • loader: The Data Loader instance from which the model file will be streamed.
  • logger: This property is used to create a QvacLogger instance, which handles all logging functionality.
  • opts.stats: This flag determines whether to calculate inference stats.
  • diskPath: The local directory where the model file will be downloaded to.
  • modelName: The name of model file in the Data Loader.

4. Create config

The config is a string consisting of a set of hyper-parameters which can be used to tweak the behaviour of the model.
Each parameter is separated by a tab (\t) from its value, and different parameters are separated by newlines (\n).

// an example of possible configuration
const config = '-ngl\t99\n--batch-size\t1024\n-dev\tgpu'
ParameterRange / TypeDefaultDescription
-dev"gpu" or "cpu""gpu"Device to run inference on
-nglinteger0Number of model layers to offload to GPU
--batch-sizeinteger2048Tokens for processing multiple prompts together
--pooling{none,mean,cls,last,rank}model defaultPooling type for embeddings
--attention{causal,non-causal}model defaultAttention type for embeddings
--embd-normalizeinteger2Embedding normalization (-1=none, 0=max abs int16, 1=taxicab, 2=euclidean, >2=p-norm)
-fa"on", "off", or "auto""auto"Enable/disable flash attention
--main-gpuinteger, "integrated", or "dedicated"GPU selection for multi-GPU systems
verbosity0 – 3 (0=ERROR, 1=WARNING, 2=INFO, 3=DEBUG)0Logging verbosity level

IGPU/GPU selection logic:

Scenariomain-gpu not specifiedmain-gpu: "dedicated"main-gpu: "integrated"
Devices consideredAll GPUs (dedicated + integrated)Only dedicated GPUsOnly integrated GPUs
System with iGPU only✅ Uses iGPU❌ Falls back to CPU✅ Uses iGPU
System with dedicated GPU only✅ Uses dedicated GPU✅ Uses dedicated GPU❌ Falls back to CPU
System with both✅ Uses dedicated GPU (preferred)✅ Uses dedicated GPU✅ Uses integrated GPU

5. Instantiate the model

const model = new GGMLBert(args, config)

6. Load the model

await model.load()

Optionally you can pass the following parameters to tweak the loading behaviour.

  • close?: This boolean value determines whether to close the Data Loader after loading. Defaults to true
  • reportProgressCallback?: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.

For example:

await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))

Progress Callback Data

The progress callback receives an object with the following properties:

PropertyTypeDescription
actionstringCurrent operation being performed
totalSizenumberTotal bytes to be loaded
totalFilesnumberTotal number of files to process
filesProcessednumberNumber of files completed so far
currentFilestringName of file currently being processed
currentFileProgressstringPercentage progress on current file
overallProgressstringOverall loading progress percentage

7. Generate embeddings for input sequence

The model outputs a vector for the input sequence.

const query = 'Hello, can you suggest a game I can play with my 1 year old daughter?'
const response = await model.run(query)
const embeddings = await response.await()

8. Release Resources

Unload the model when finished:

try {
  await model.unload()
  await fsDL.close()
} catch (error) {
  console.error('Failed to unload model:', error)
}

More resources

Package at npm

On this page