Covalent Inference for Node.js
Composable inference primitives for forkable decode state, shared-prefix KV branching, and continuous tree batching. Branches share a KV prefix while keeping independent machinery — sampler chain, grammar, logits snapshot, perplexity tracker — for controlled divergence at decode time. BranchStore packs tokens from N branches (each at a different position, different seq_id, each needing independent logits captured) into a single llama_batch and dispatches once. kv::tenancy manages seq_id leases automatically — acquired on create()/fork(), evicted on prune(), rebuilt on retainOnly().
Built on liblloyal, a header-only C++20 inference kernel for llama.cpp.
import { createContext, Branch, BranchStore } from "@lloyal-labs/lloyal.node";
const ctx = await createContext({ modelPath: "./model.gguf", nSeqMax: 6 });
const store = new BranchStore(ctx);
// Shared prompt: "Explain quantum entanglement"
const prompt = await ctx.tokenize("Explain quantum entanglement");
const root = Branch.create(ctx, 0, { temperature: 0.8 });
await root.prefill(prompt);
// Fork 4 branches — each gets a different reasoning prefix
const analogy = await root.fork();
const formal = await root.fork();
const socratic = await root.fork();
const visual = await root.fork();
// Scatter-prefill: inject divergent prefixes in one batched dispatch
// 4 branches × variable lengths → auto bin-packed into minimal GPU calls
await store.prefill([
[analogy, await ctx.tokenize("Think of it like two coins...")], // 12 tokens
[formal, await ctx.tokenize("In quantum mechanics, the...")], // 8 tokens
[socratic, await ctx.tokenize("What happens when you measure...")], // 10 tokens
[visual, await ctx.tokenize("Imagine two particles...")], // 7 tokens
]);
// Generate — all 4 in lockstep, 1 GPU call per step
const branches = [analogy, formal, socratic, visual];
for (;;) {
const live = branches.filter(b => !b.disposed);
if (!live.length) break;
const produced = live.map(b => ({ b, ...b.produce() }));
// Prune branches that hit stop tokens
for (const p of produced.filter(p => p.isStop)) await p.b.prune();
// Commit survivors — accept + decode in one GPU dispatch
const items = produced
.filter(p => !p.isStop)
.map(p => { p.b.accept(p.token); return [p.b, p.token]; });
await store.commit(items);
}
// Winner takes all — one seq_keep pass, losers vaporized
const winner = branches
.filter(b => !b.disposed)
.reduce((a, b) => (a.perplexity < b.perplexity ? a : b));
await store.retainOnly(winner);
// store.available === nSeqMax - 1 — all leases recoveredOr for single-branch generation, Branch is an async iterable — generate until EOG:
for await (const { token, text } of branch) {
process.stdout.write(text);
}Tree search with N branches means N calls to llama_decode() — each paying GPU dispatch overhead, memory barriers, and PCIe round-trips. BranchStore eliminates this: tokens from N branches — each at a different position, different seq_id, each needing independent logits captured — are packed into a single llama_batch and dispatched once. N branches, 1 GPU call.
Two packing strategies for different access patterns:
// commit: 1 token per branch — one GPU dispatch for N branches
await store.commit([[branch1, tok1], [branch2, tok2], [branch3, tok3]]);
// prefill: variable tokens per branch — asymmetric injection
await store.prefill([
[branchA, systemTokens], // 200 tokens
[branchB, queryTokens], // 12 tokens
[branchC, docTokens], // 800 tokens
]);
// Greedy bin-packed into ceil(total / nBatch) dispatchesTwo resources, two scales. Slots (65K) are how many branches can exist — cheap CPU state. Leases (nSeqMax) are how many can decode — scarce KV cache residency. Tenancy manages the scarce resource automatically: leases are acquired on create()/fork(), evicted on prune(), rebuilt on retainOnly(). No manual seq_id tracking, ever.
store.available; // leases remaining — use for width/depth budget
await store.retainOnly(winner); // nuclear: 1 seq_keep, rebuild vacancyThe turn lifecycle: search is surgical (N × prune()), promotion is nuclear (1 × retainOnly()). Per turn, fork → expand → evaluate → prune losers → repeat. Between turns, promote winner → tree is gone → next turn starts fresh.
Parent/child edges are always-on. Simple chat → best-of-N → deep search is one continuum.
branch.parent; // handle or null if root
branch.children; // child handles
branch.isLeaf; // no children?
branch.isActive; // holds a KV lease?| Method | FK analogy | Behavior |
|---|---|---|
prune() |
RESTRICT | Throws if children exist |
pruneSubtree() |
CASCADE | Iterative post-order traversal |
npm install @lloyal-labs/lloyal.nodePrebuilt binaries for 13 platform/GPU combinations. GPU selection at runtime, not install time.
| Platform | Arch | Acceleration |
|---|---|---|
| macOS | arm64 | Metal |
| macOS | x64 | CPU |
| Linux | x64 | CPU / CUDA / Vulkan |
| Linux | arm64 | CPU / CUDA / Vulkan |
| Windows | x64 | CPU / CUDA / Vulkan |
| Windows | arm64 | CPU / Vulkan |
CI integration testing (real inference):
| Architecture | Test Model | Template |
|---|---|---|
| Llama | Llama 3.2 1B | llama3 |
| Phi | Phi 3.5 Mini | phi3 |
| Qwen | Qwen 3 1.7B | chatml |
| Gemma | Gemma 3 1B | gemma |
| SmolLM | SmolLM2 1.7B | chatml |
| Ministral | Ministral 3B | mistral |
See distribution.md for details.
| Example | Pattern |
|---|---|
best-of-n/ |
Branch API: fork, produce/commit, perplexity selection |
speculative/ |
Branch API: draft/verify, fork/prune, bonus token sampling |
streaming/ |
Infinite context via BlinkKV reseeding with sidecar summarization |
entropy/ |
modelEntropy() mid-generation as control signal |
grammar/ |
Pull loop with generators, JSON schema constraints, KV + grammar branching |
chat/ |
Interactive streaming chat |
embed/ |
Text embeddings extraction |
node examples/best-of-n/best-of-n.mjs
node examples/speculative/speculative.mjsEach example has a README explaining the pattern.
Model uncertainty mid-generation enables dynamic behavior:
const entropy = ctx.modelEntropy("bits");
if (entropy > 4.0) {
// High uncertainty — model is guessing
// Trigger retrieval, reduce temperature, or branch
}See examples/entropy/ for entropy-triggered sampling strategies.
For fine-grained control without Branch:
| Approach | Method | Use Case |
|---|---|---|
| Sequence copy | kvSeqCopy(src, dst) |
Share prefix across sequences |
| Snapshot/restore | kvCacheSave() / kvCacheLoad() |
Sequential exploration, return to checkpoint |
const grammar = await ctx.jsonSchemaToGrammar(schema);
const branch = Branch.create(ctx, 0, params, undefined, grammar);
await branch.prefill(promptTokens);
// Grammar state cloned automatically on fork()See examples/grammar/ for the full branch fork pattern.
Full API documentation: lloyal-ai.github.io/lloyal.node
Generated from lib/index.d.ts with TypeDoc.
| Package | Runtime | Description |
|---|---|---|
| liblloyal | C++ | Header-only inference kernel |
| lloyal.node | Node.js | This package |
| nitro-llama | React Native | Mobile bindings via Nitro Modules |
| tsampler | TypeScript | Reference sampler implementation |
See CONTRIBUTING.md for development setup and release process.
Apache 2.0 — See LICENSE for details.