lloyal.node

Covalent Inference for Node.js

Composable inference primitives for forkable decode state, shared-prefix KV branching, and continuous tree batching. Branches share a KV prefix while keeping independent machinery — sampler chain, grammar, logits snapshot, perplexity tracker — for controlled divergence at decode time. BranchStore packs tokens from N branches (each at a different position, different seq_id, each needing independent logits captured) into a single llama_batch and dispatches once. kv::tenancy manages seq_id leases automatically — acquired on create()/fork(), evicted on prune(), rebuilt on retainOnly().

Built on liblloyal, a header-only C++20 inference kernel for llama.cpp.

The Branch API

import { createContext, Branch, BranchStore } from "@lloyal-labs/lloyal.node";

const ctx = await createContext({ modelPath: "./model.gguf", nSeqMax: 6 });
const store = new BranchStore(ctx);

// Shared prompt: "Explain quantum entanglement"
const prompt = await ctx.tokenize("Explain quantum entanglement");

const root = Branch.create(ctx, 0, { temperature: 0.8 });
await root.prefill(prompt);

// Fork 4 branches — each gets a different reasoning prefix
const analogy  = await root.fork();
const formal   = await root.fork();
const socratic = await root.fork();
const visual   = await root.fork();

// Scatter-prefill: inject divergent prefixes in one batched dispatch
// 4 branches × variable lengths → auto bin-packed into minimal GPU calls
await store.prefill([
  [analogy,  await ctx.tokenize("Think of it like two coins...")],    // 12 tokens
  [formal,   await ctx.tokenize("In quantum mechanics, the...")],     // 8 tokens
  [socratic, await ctx.tokenize("What happens when you measure...")], // 10 tokens
  [visual,   await ctx.tokenize("Imagine two particles...")],         // 7 tokens
]);

// Generate — all 4 in lockstep, 1 GPU call per step
const branches = [analogy, formal, socratic, visual];
for (;;) {
  const live = branches.filter(b => !b.disposed);
  if (!live.length) break;
  const produced = live.map(b => ({ b, ...b.produce() }));

  // Prune branches that hit stop tokens
  for (const p of produced.filter(p => p.isStop)) await p.b.prune();

  // Commit survivors — accept + decode in one GPU dispatch
  const items = produced
    .filter(p => !p.isStop)
    .map(p => { p.b.accept(p.token); return [p.b, p.token]; });
  await store.commit(items);
}

// Winner takes all — one seq_keep pass, losers vaporized
const winner = branches
  .filter(b => !b.disposed)
  .reduce((a, b) => (a.perplexity < b.perplexity ? a : b));
await store.retainOnly(winner);
// store.available === nSeqMax - 1 — all leases recovered

Or for single-branch generation, Branch is an async iterable — generate until EOG:

for await (const { token, text } of branch) {
  process.stdout.write(text);
}

Continuous Tree Batching

Tree search with N branches means N calls to llama_decode() — each paying GPU dispatch overhead, memory barriers, and PCIe round-trips. BranchStore eliminates this: tokens from N branches — each at a different position, different seq_id, each needing independent logits captured — are packed into a single llama_batch and dispatched once. N branches, 1 GPU call.

Two packing strategies for different access patterns:

// commit: 1 token per branch — one GPU dispatch for N branches
await store.commit([[branch1, tok1], [branch2, tok2], [branch3, tok3]]);

// prefill: variable tokens per branch — asymmetric injection
await store.prefill([
  [branchA, systemTokens],  // 200 tokens
  [branchB, queryTokens],   //  12 tokens
  [branchC, docTokens],     // 800 tokens
]);
// Greedy bin-packed into ceil(total / nBatch) dispatches

KV Tenancy

Two resources, two scales. Slots (65K) are how many branches can exist — cheap CPU state. Leases (nSeqMax) are how many can decode — scarce KV cache residency. Tenancy manages the scarce resource automatically: leases are acquired on create()/fork(), evicted on prune(), rebuilt on retainOnly(). No manual seq_id tracking, ever.

store.available;                   // leases remaining — use for width/depth budget
await store.retainOnly(winner);    // nuclear: 1 seq_keep, rebuild vacancy

The turn lifecycle: search is surgical (N × prune()), promotion is nuclear (1 × retainOnly()). Per turn, fork → expand → evaluate → prune losers → repeat. Between turns, promote winner → tree is gone → next turn starts fresh.

Topology

Parent/child edges are always-on. Simple chat → best-of-N → deep search is one continuum.

branch.parent;       // handle or null if root
branch.children;     // child handles
branch.isLeaf;       // no children?
branch.isActive;     // holds a KV lease?

Method	FK analogy	Behavior
`prune()`	RESTRICT	Throws if children exist
`pruneSubtree()`	CASCADE	Iterative post-order traversal

Install

npm install @lloyal-labs/lloyal.node

Prebuilt binaries for 13 platform/GPU combinations. GPU selection at runtime, not install time.

Platform	Arch	Acceleration
macOS	arm64	Metal
macOS	x64	CPU
Linux	x64	CPU / CUDA / Vulkan
Linux	arm64	CPU / CUDA / Vulkan
Windows	x64	CPU / CUDA / Vulkan
Windows	arm64	CPU / Vulkan

CI integration testing (real inference):

Architecture	Test Model	Template
Llama	Llama 3.2 1B	llama3
Phi	Phi 3.5 Mini	phi3
Qwen	Qwen 3 1.7B	chatml
Gemma	Gemma 3 1B	gemma
SmolLM	SmolLM2 1.7B	chatml
Ministral	Ministral 3B	mistral

See distribution.md for details.

Examples

Example	Pattern
`best-of-n/`	Branch API: fork, produce/commit, perplexity selection
`speculative/`	Branch API: draft/verify, fork/prune, bonus token sampling
`streaming/`	Infinite context via BlinkKV reseeding with sidecar summarization
`entropy/`	`modelEntropy()` mid-generation as control signal
`grammar/`	Pull loop with generators, JSON schema constraints, KV + grammar branching
`chat/`	Interactive streaming chat
`embed/`	Text embeddings extraction

node examples/best-of-n/best-of-n.mjs
node examples/speculative/speculative.mjs

Each example has a README explaining the pattern.

Other Patterns

Entropy as Control Signal

Model uncertainty mid-generation enables dynamic behavior:

const entropy = ctx.modelEntropy("bits");

if (entropy > 4.0) {
  // High uncertainty — model is guessing
  // Trigger retrieval, reduce temperature, or branch
}

See examples/entropy/ for entropy-triggered sampling strategies.

Low-Level KV Operations

For fine-grained control without Branch:

Approach	Method	Use Case
Sequence copy	`kvSeqCopy(src, dst)`	Share prefix across sequences
Snapshot/restore	`kvCacheSave()` / `kvCacheLoad()`	Sequential exploration, return to checkpoint

Grammar-Constrained Generation

const grammar = await ctx.jsonSchemaToGrammar(schema);
const branch = Branch.create(ctx, 0, params, undefined, grammar);
await branch.prefill(promptTokens);
// Grammar state cloned automatically on fork()

See examples/grammar/ for the full branch fork pattern.

API Reference

Full API documentation: lloyal-ai.github.io/lloyal.node

Generated from lib/index.d.ts with TypeDoc.

Ecosystem

Package	Runtime	Description
liblloyal	C++	Header-only inference kernel
lloyal.node	Node.js	This package
nitro-llama	React Native	Mobile bindings via Nitro Modules
tsampler	TypeScript	Reference sampler implementation

Contributing

See CONTRIBUTING.md for development setup and release process.

License

Apache 2.0 — See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
.github		.github
cmake		cmake
docs		docs
examples		examples
lib		lib
liblloyal @ 388e255		liblloyal @ 388e255
llama.cpp @ e2f19b3		llama.cpp @ e2f19b3
packages/template		packages/template
scripts		scripts
src		src
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
.npmignore		.npmignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
typedoc.json		typedoc.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lloyal.node

The Branch API

Continuous Tree Batching

KV Tenancy

Topology

Install

Examples

Other Patterns

Entropy as Control Signal

Low-Level KV Operations

Grammar-Constrained Generation

API Reference

Ecosystem

Contributing

License

About

Uh oh!

Releases 11

Packages

Uh oh!

Languages

License

lloyal-ai/lloyal.node

Folders and files

Latest commit

History

Repository files navigation

lloyal.node

The Branch API

Continuous Tree Batching

KV Tenancy

Topology

Install

Examples

Other Patterns

Entropy as Control Signal

Low-Level KV Operations

Grammar-Constrained Generation

API Reference

Ecosystem

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Languages

Packages