Completion

LLM inference for text generation and chat — i.e., use a large language model to generate text output based on input prompts.

Overview

Completion uses llama.cpp as inference engine. Load any supported model using modelType: "llm". Then, provide array history as input where each element is an object with properties:

role: string; can be either "user" or "assistant"
content: string

role: "user" indicates that content is a previous prompt. role: "assistant" indicates that content is a previous inference (LLM output).

Completion output is generated based on the full sequence of messages provided in history.

Functions

Use the following sequence of function calls:

For how to use each function, see SDK — API reference.

Models

You can load any llama.cpp-compatible text-generation/chat model. Model file format: *.gguf.

If the model is sharded across multiple files (a multi-file bundle), see Sharded models.
For multimodal prompts (images + text), see Multimodal.
For models available as constants, see SDK — Models.

Features

Tool calls: let the model emit structured tool calls and stream tool-call events alongside tokens.
MCP: plug MCP servers into completion() so the model can use external tools (e.g. web search) via the same tool-call mechanism.
KV cache: cache and reuse the model’s key/value attention state to speed up follow-up turns in long conversations.

Examples

Usage

The following script shows a basic example of completion:

completion.js

import { completion, LLAMA_3_2_1B_INST_Q4_0, loadModel, downloadAsset, unloadModel, VERBOSITY, } from "@qvac/sdk";
try {
    // First just cache the model
    await downloadAsset({
        assetSrc: LLAMA_3_2_1B_INST_Q4_0,
        onProgress: (progress) => {
            console.log(progress);
        },
    });
    // Then load it in memory from cache
    const modelId = await loadModel({
        modelSrc: LLAMA_3_2_1B_INST_Q4_0,
        modelType: "llm",
        modelConfig: {
            device: "gpu",
            ctx_size: 2048,
            verbosity: VERBOSITY.ERROR,
        },
    });
    const history = [
        {
            role: "user",
            content: "Explain quantum computing in one sentence, use lots of emojis",
        },
    ];
    const result = completion({ modelId, history, stream: true });
    for await (const token of result.tokenStream) {
        process.stdout.write(token);
    }
    const stats = await result.stats;
    console.log("\n📊 Performance Stats:", stats);
    // Change `clearStorage: true` to delete cached model files
    await unloadModel({ modelId, clearStorage: false });
}
catch (error) {
    console.error("❌ Error:", error);
    process.exit(1);
}

Tool call

The following script shows how to provide tool definitions to completion(), stream toolCallStream events, and read the parsed tool calls:

completion-tool-call.js

import { z } from "zod";
import { completion, loadModel, unloadModel, QWEN3_1_7B_INST_Q4, } from "../index.js";
// Define Zod schemas for tool parameters
const weatherSchema = z.object({
    city: z.string().describe("City name"),
    country: z.string().describe("Country code").optional(),
});
const horoscopeSchema = z.object({
    sign: z.string().describe("An astrological sign like Taurus or Aquarius"),
});
// Map tool names to their schemas for runtime validation
const toolSchemas = {
    get_weather: weatherSchema,
    get_horoscope: horoscopeSchema,
};
// Simple tool definitions - just name, description, and Zod schema!
const tools = [
    {
        name: "get_weather",
        description: "Get current weather for a city",
        parameters: weatherSchema,
    },
    {
        name: "get_horoscope",
        description: "Get today's horoscope for an astrological sign",
        parameters: horoscopeSchema,
    },
];
try {
    // Load model from provided file path with tools support enabled
    const modelId = await loadModel({
        modelSrc: QWEN3_1_7B_INST_Q4,
        modelType: "llm",
        modelConfig: {
            ctx_size: 4096,
            tools: true, // Enable tools support
        },
        onProgress: (progress) => console.log(`Loading: ${progress.percentage.toFixed(1)}%`),
    });
    console.log(`✅ Model loaded successfully! Model ID: ${modelId}`);
    // Create conversation history
    const history = [
        {
            role: "system",
            content: "You are a helpful assistant that can use tools to get the weather and horoscope.",
        },
        {
            role: "user",
            content: "What's the weather in Tokyo and my horoscope for Aquarius?",
        },
    ];
    console.log("\n🤖 AI Response:");
    console.log("(Streaming with tool definitions in prompt)\n");
    const result = completion({ modelId, history, stream: true, tools });
    // Consume token stream
    const tokensTask = (async () => {
        for await (const token of result.tokenStream) {
            process.stdout.write(token);
        }
    })();
    // Consume tool call events
    const toolsTask = (async () => {
        for await (const evt of result.toolCallStream) {
            if (evt.type === "toolCall") {
                console.log(`\n\n→ Tool Call Detected: ${evt.call.name}(${JSON.stringify(evt.call.arguments)})`);
                console.log(`   ID: ${evt.call.id}`);
            }
            else if (evt.type === "toolCallError") {
                console.warn(`\n⚠️  Tool Error: ${evt.error.message}`);
                console.warn(`   Code: ${evt.error.code}`);
            }
        }
    })();
    await Promise.all([tokensTask, toolsTask]);
    const stats = await result.stats;
    const toolCalls = await result.toolCalls;
    console.log("\n\n📋 Parsed Tool Calls:");
    if (toolCalls.length > 0) {
        for (const call of toolCalls) {
            console.log(`  - ${call.name}(${JSON.stringify(call.arguments)})`);
            const schema = toolSchemas[call.name];
            if (schema) {
                const validated = schema.safeParse(call.arguments);
                if (validated.success) {
                    console.log(`    ✓ Arguments validated with Zod`);
                }
                else {
                    console.log(`    ✗ Validation failed:`, validated.error);
                }
            }
        }
    }
    else {
        console.log("  No tool calls detected in response");
    }
    console.log("\n📊 Performance Stats:", stats);
    // Execute tool calls and send results back to the model
    if (toolCalls.length > 0) {
        console.log("\n\n🔧 Simulating Tool Execution...");
        // Simulate tool execution (in a real app, you'd call actual APIs)
        const toolResults = toolCalls.map((call) => {
            let result = "";
            if (call.name === "get_weather") {
                const args = call.arguments;
                result = `The weather in ${args.city} is sunny, 22°C with light clouds.`;
            }
            else if (call.name === "get_horoscope") {
                const args = call.arguments;
                result = `Horoscope for ${args.sign}: Today is a great day for new beginnings and creative endeavors!`;
            }
            console.log(`  ✓ ${call.name}: ${result}`);
            return { toolCallId: call.id, result };
        });
        // Add tool results to conversation history
        history.push({
            role: "assistant",
            content: await result.text,
        });
        // Add tool results as tool messages
        for (const toolResult of toolResults) {
            history.push({
                role: "tool",
                content: toolResult.result,
            });
        }
        // Send follow-up question with tool results
        console.log("\n\n🤖 Follow-up Response with Tool Results:");
        const followUpResult = completion({
            modelId,
            history,
            stream: true,
            tools,
        });
        for await (const token of followUpResult.tokenStream) {
            process.stdout.write(token);
        }
        const followUpStats = await followUpResult.stats;
        console.log("\n\n📊 Follow-up Stats:", followUpStats);
    }
    console.log("\n\n🎉 Completed!");
    await unloadModel({ modelId, clearStorage: false });
}
catch (error) {
    console.error("❌ Error:", error);
    process.exit(1);
}

You create and manage the MCP client, connect it to one or more MCP servers, and pass it to completion(). The following script shows how to attach an MCP client to completion() so the model can call a web search tool and then continue with the results:

completion-mcp.js

/**
 * MCP DuckDuckGo Search Example
 *
 * A web search example using DuckDuckGo - no API key required!
 * The server provides tools to search the web and get answers.
 *
 * Prerequisites:
 * - Install MCP SDK: bun add @modelcontextprotocol/how-tos
 *
 * Run with: bun run examples/mcp-websearch.ts
 */
import { completion, loadModel, unloadModel, QWEN3_1_7B_INST_Q4, } from "../index.js";
// MCP SDK is a user-installed optional dependency
// Install with: bun add @modelcontextprotocol/how-tos
import { Client } from "@modelcontextprotocol/how-tos/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/how-tos/client/stdio.js";
function parseSearchResults(mcpResult) {
    try {
        const result = mcpResult;
        // Extract text content from MCP response
        const textContent = result.content?.find((c) => c.type === "text");
        if (!textContent?.text) {
            return JSON.stringify(mcpResult);
        }
        // Parse the JSON array of search results
        const rawResults = JSON.parse(textContent.text);
        // Extract just the useful fields (title, url, snippet)
        const cleanResults = rawResults.slice(0, 5).map((r) => ({
            title: r.title ?? "Unknown",
            url: r.url ?? "",
            snippet: r.snippet ?? "",
        }));
        // Format as concise text for LLM
        return cleanResults
            .map((r, i) => `[${i + 1}] ${r.title}\n    URL: ${r.url}\n    ${r.snippet}`)
            .join("\n\n");
    }
    catch {
        // If parsing fails, return a truncated version
        const str = typeof mcpResult === "string" ? mcpResult : JSON.stringify(mcpResult);
        return str.slice(0, 2000);
    }
}
let mcpClient = null;
try {
    console.log("🦆 MCP DuckDuckGo Search Example\n");
    // ============================================================
    // STEP 1: Connect to DuckDuckGo MCP server
    // ============================================================
    console.log("1️⃣  Starting DuckDuckGo MCP server...");
    mcpClient = new Client({
        name: "qvac-ddg-example",
        version: "1.0.0",
    });
    const transport = new StdioClientTransport({
        command: "npx",
        args: ["-y", "@oevortex/ddg_search"],
    });
    await mcpClient.connect(transport);
    console.log("   ✓ MCP server connected\n");
    // ============================================================
    // STEP 2: Load model
    // ============================================================
    console.log("2️⃣  Loading model...");
    const modelId = await loadModel({
        modelSrc: QWEN3_1_7B_INST_Q4,
        modelType: "llm",
        modelConfig: {
            ctx_size: 4096,
            tools: true,
        },
        onProgress: (progress) => process.stdout.write(`\r   Loading: ${progress.percentage.toFixed(1)}%`),
    });
    console.log(`\n   ✓ Model loaded\n`);
    // ============================================================
    // STEP 3: Ask AI to search the web (with MCP client)
    // ============================================================
    const history = [
        {
            role: "system",
            content: `You are a helpful assistant with access to web search.
Use the search tool when you need current information.
Always cite your sources with the URL.`,
        },
        {
            role: "user",
            content: "What is the current weather in New York City?",
        },
    ];
    console.log("3️⃣  Asking AI to search the web...\n");
    console.log("🤖 AI Response:");
    // Pass MCP client directly to completion - tools are adapted internally!
    const result = completion({
        modelId,
        history,
        stream: true,
        mcp: [{ client: mcpClient, includeResources: false }],
    });
    for await (const token of result.tokenStream) {
        process.stdout.write(token);
    }
    const toolCalls = await result.toolCalls;
    console.log("\n");
    // ============================================================
    // STEP 4: Execute tool calls using call() - automatic MCP routing!
    // ============================================================
    if (toolCalls.length > 0) {
        console.log("4️⃣  Executing search...\n");
        const toolResults = [];
        for (const toolCall of toolCalls) {
            console.log(`🔍 ${toolCall.name}(${JSON.stringify(toolCall.arguments)})`);
            if (!toolCall.invoke) {
                console.log(`   ⚠️ No handler found for tool "${toolCall.name}"`);
                continue;
            }
            // Use invoke() - automatically routes to the correct MCP client!
            const mcpResult = await toolCall.invoke();
            // Parse and clean up the search results
            const cleanResult = parseSearchResults(mcpResult);
            console.log(`   ✓ Got search results:`);
            console.log(cleanResult
                .split("\n")
                .map((l) => `      ${l}`)
                .join("\n"));
            console.log();
            toolResults.push({ id: toolCall.id, result: cleanResult });
        }
        // ============================================================
        // STEP 5: Continue with search results
        // ============================================================
        console.log("5️⃣  Getting AI response with search results...\n");
        history.push({
            role: "assistant",
            content: await result.text,
        });
        for (const tr of toolResults) {
            history.push({
                role: "tool",
                content: tr.result,
            });
        }
        console.log("🤖 Final Response:");
        const finalResult = completion({
            modelId,
            history,
            stream: true,
            mcp: [{ client: mcpClient, includeResources: false }],
        });
        for await (const token of finalResult.tokenStream) {
            process.stdout.write(token);
        }
        console.log("\n");
    }
    // ============================================================
    // Cleanup
    // ============================================================
    console.log("6️⃣  Cleaning up...");
    await unloadModel({ modelId, clearStorage: false });
    console.log("   ✓ Done\n");
    console.log("🎉 Example completed!");
    process.exit(0);
}
catch (error) {
    console.error("❌ Error:", error);
    process.exit(1);
}
finally {
    if (mcpClient) {
        try {
            await mcpClient.close();
        }
        catch {
            // Ignore close errors
        }
    }
}

KVcache

The following script enables kvCache: true to speed up follow-up turns, and then compares it with kvCache: false on the same history:

completion-kv-cache.js

import { completion, LLAMA_3_2_1B_INST_Q4_0, loadModel, unloadModel, VERBOSITY, } from "@qvac/sdk";
try {
    // Load the model
    const modelId = await loadModel({
        modelSrc: LLAMA_3_2_1B_INST_Q4_0,
        modelType: "llm",
        modelConfig: {
            device: "gpu",
            ctx_size: 2048,
            verbosity: VERBOSITY.ERROR,
        },
    });
    console.log("🧠 Testing KV Cache functionality...\n");
    // First conversation with cache enabled
    console.log("📝 First conversation (building cache):");
    const history1 = [
        { role: "user", content: "What is the capital of France?" },
    ];
    const result1 = completion({
        modelId,
        history: history1,
        stream: true,
        kvCache: true,
    }); // kvCache = true
    let response1 = "";
    for await (const token of result1.tokenStream) {
        response1 += token;
        process.stdout.write(token);
    }
    const stats1 = await result1.stats;
    console.log(`\n⏱️  First completion stats: ${JSON.stringify(stats1)}\n`);
    // Continue conversation (should reuse cache from previous conversation)
    console.log("🔄 Continuing conversation (reusing cache):");
    const history2 = [
        { role: "user", content: "What is the capital of France?" },
        { role: "assistant", content: response1.trim() },
        { role: "user", content: "What about Germany?" },
    ];
    // This should:
    // 1. Find existing cache from [user: "What is the capital of France?"] (history minus last message)
    // 2. Load that cache and process the new "What about Germany?" message
    // 3. Save the updated cache and rename it to include all messages
    const result2 = completion({
        modelId,
        history: history2,
        stream: true,
        kvCache: true,
    }); // kvCache = true
    for await (const token of result2.tokenStream) {
        process.stdout.write(token);
    }
    const stats2 = await result2.stats;
    console.log(`\n⏱️  Second completion stats: ${JSON.stringify(stats2)}\n`);
    // Compare with non-cached version
    console.log("🚀 Same conversation without cache:");
    const result3 = completion({
        modelId,
        history: history2,
        stream: true,
        kvCache: false,
    }); // kvCache = false
    for await (const token of result3.tokenStream) {
        process.stdout.write(token);
    }
    const stats3 = await result3.stats;
    console.log(`\n⏱️  Non-cached completion stats: ${JSON.stringify(stats3)}\n`);
    console.log("✅ KV Cache test completed!");
    await unloadModel({ modelId, clearStorage: false });
}
catch (error) {
    console.error("❌ Error:", error);
    process.exit(1);
}

Tip: all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see SDK quickstart.

On this page