Sharded models

Overview

Some models may be distributed as multiple files (shards) instead of a single large one. loadModel() supports sharded models and ensures that all shards are available before loading it. For this, shard file names must follow the pattern: <name>-00001-of-0000X.<ext>.

Important

For now, sharded models are supported only for GGUF models.

Supported formats

Archives (.tar, .tar.gz, .tgz): HTTP or local with automatic extraction
HTTP sharded URL: pass the download URL of any shard and the SDK will fetch the remaining shards
Hyperdrive: use any sharded Hyperdrive model source
Local shards: pass the path to any shard file.

Local sharded models

All files must be present in the same directory:

All numbered shard files, for example:
- model-00001-of-00005.gguf
- model-00002-of-00005.gguf
- …
- model-00005-of-00005.gguf
A companion tensors manifest file:
- model.tensors.txt

When using loadModel(), you pass the path to the first shard (e.g., model-00001-of-00005.gguf), and the SDK automatically detects and loads the remaining shards.

Functions

loadModel() — pass the path/URL of any shard; SDK fetches remaining shards
Use the model as usual (completion(), embed(), etc.)
unloadModel()

For how to use each function, see SDK — API reference.

Example

The following script shows an example of loading a sharded model with per-shard progress tracking:

sharded-models.js

import { completion, loadModel, unloadModel, VERBOSITY } from "@qvac/sdk";
// Sharded models can be loaded from:
// 1. HTTP archives: "https://example.com/model.tar.gz"
// 2. HTTP pattern: "https://example.com/model-00001-of-00005.gguf"
// 3. Hyperdrive: use any sharded model source/constant, eg: LLAMA_3_2_1B_INST_Q4_0_SHARD
// 4. Local filesystem: pass the path to the first shard file (Note: All shards must be in the same directory)
// 5. Local archive: pass the path to the archive file (.tar, .tar.gz, .tgz)
try {
    const modelId = await loadModel({
        modelSrc: "https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/qwen2.5-coder-7b-instruct-q4_0-00001-of-00002.gguf",
        modelType: "llm",
        modelConfig: {
            device: "gpu",
            ctx_size: 2048,
            verbosity: VERBOSITY.ERROR,
        },
        onProgress: (progress) => {
            // For sharded models, progress.shardInfo contains detailed progress for both
            // individual shards AND overall download progress across all shards
            if (progress.shardInfo) {
                // For pattern-based or Hyperdrive shards
                const { shardInfo } = progress;
                console.log(`📥 Downloading ${shardInfo.shardName} (${shardInfo.currentShard}/${shardInfo.totalShards})\n` +
                    `   File: ${progress.percentage.toFixed(1)}% (${(progress.downloaded / 1024 / 1024).toFixed(2)}MB / ${(progress.total / 1024 / 1024).toFixed(2)}MB)\n` +
                    `   Overall: ${shardInfo.overallPercentage.toFixed(1)}% (${(shardInfo.overallDownloaded / 1024 / 1024).toFixed(2)}MB / ${(shardInfo.overallTotal / 1024 / 1024).toFixed(2)}MB)`);
            }
            else {
                // For archive-based shards
                console.log(`📥 Progress: ${progress.percentage.toFixed(1)}% ` +
                    `(${(progress.downloaded / 1024 / 1024).toFixed(2)}MB / ${(progress.total / 1024 / 1024).toFixed(2)}MB)`);
            }
        },
    });
    const history = [
        {
            role: "user",
            content: "What are the benefits of sharding large language models? Use emojis in your response.",
        },
    ];
    const result = completion({ modelId, history, stream: true });
    console.log("\n🤖 Model response:");
    for await (const token of result.tokenStream) {
        process.stdout.write(token);
    }
    const stats = await result.stats;
    console.log("\n\n📊 Performance Stats:", stats);
    await unloadModel({ modelId, clearStorage: false });
    process.exit(0);
}
catch (error) {
    console.error("❌ Error:", error);
    process.exit(1);
}

Tip: all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see SDK quickstart.