Dynamic memory allocation as in Ollama #544

ittaboba · 2026-01-19T17:40:05Z

ittaboba
Jan 19, 2026

Hi there,

I noticed Ollama allocates memory just for the time of text generation. Memory pressure before and after drops. Not sure about how they achieve it, possibly related to something like this

It doesn't seem to be the case for LM Studio: as long as the model is loaded, memory pressure stays constant.

My app uses node-llama-cpp and currently behaves like LM Studio. Is there any setup to achieve the Ollama behavior?

I thought about loading the model at the beginning of each generation and calling dispose() at the end of it as explained here. But Ollama seems to be much more reactive, especially on bigger models like gpt-oss it doesn't seem they completely offload and reload on each generation.

In any case, thanks for this great library!

giladgd · 2026-01-21T01:02:15Z

giladgd
Jan 21, 2026
Maintainer

@ittaboba The behavior you noticed is most likely the saving and restoring of the KV cache (the evaluation state), which saves you time on evaluating the past chat history, so only the new prompt is evaluated.
Because of mmap, unloading and reloading a model is pretty fast when done in a short sequence, so your approach of disposing objects and restoring them later is correct, you were just missing the evaluation state restoration optimization.

I do plan to make all of this automatically happen behind the scenes by default in the upcoming node-llama-cpp v4 beta, but it won't be on the first release.

I've made a code snippet you can run that demonstrates this, run it with npx vite-node ./snippet.ts:

import fs from "node:fs/promises";
import os from "node:os";
import path from "node:path";
import crypto from "node:crypto";
import {getLlama, resolveModelFile, LlamaChatSession} from "node-llama-cpp";

const modelUri = "hf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf";

const stateFile = path.join(os.tmpdir(), crypto.randomUUID() + ".bin");
await using deleteStateFile = {
    [Symbol.asyncDispose]() {
        console.log("Deleting state file: " + stateFile);
        return fs.rm(stateFile, {force: true});
    }
};
console.log("state file: " + stateFile);


const llama = await getLlama();
const modelPath = await resolveModelFile(modelUri);

const startTime = Date.now();
const model = await llama.loadModel({modelPath});
const context = await model.createContext({
    // this is the default config in Ollama for this model IIRC
    contextSize: {max: 8192}
});
const endTime = Date.now();
console.log("Loaded in", (endTime - startTime) + "ms");

const sequence = context.getSequence();
const session = new LlamaChatSession({contextSequence: sequence});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const promptStartTime = Date.now();
const a1 = await session.prompt(q1);
const promptEndTime = Date.now();

console.log("AI: " + a1);
console.log("Evaluation time", (promptEndTime - promptStartTime) + "ms");
console.log("Evaluated input tokens", sequence.tokenMeter.usedInputTokens);

await sequence.saveStateToFile(stateFile);
const chatHistory = session.getChatHistory();
session.dispose();
context.dispose();
model.dispose();

// wait 10 seconds, to see the memory pressure go down
console.log("Waiting 10 seconds before restoring state...");
await new Promise((accept) => setTimeout(accept, 1000 * 10));

// then later restore from where we left off and evaluate another prompt
// on top of the previous chat history
console.log("Restoring state from file: " + stateFile);

const startTime2 = Date.now();
const model2 = await llama.loadModel({modelPath});
const context2 = await model2.createContext({
    contextSize: {max: 8192}
});
const sequence2 = context2.getSequence();
const session2 = new LlamaChatSession({contextSequence: sequence2});
session2.setChatHistory(chatHistory);

// try commenting out this line and see how the "Evaluated input tokens"
// changes to evaluate the entire history again, and the evaluation time increases
await sequence2.loadStateFromFile(stateFile, {acceptRisk: true});
const endTime2 = Date.now();
console.log("Restored in", (endTime2 - startTime2) + "ms");

const q2 = "What did I just ask you?";
console.log("User: " + q2);

const promptStartTime2 = Date.now();
const a2 = await session2.prompt(q2);
const promptEndTime2 = Date.now();

console.log("AI: " + a2);
console.log("Evaluation time", (promptEndTime2 - promptStartTime2) + "ms");
console.log("Evaluated input tokens", sequence2.tokenMeter.usedInputTokens);

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamic memory allocation as in Ollama #544

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Dynamic memory allocation as in Ollama #544

Uh oh!

ittaboba Jan 19, 2026

Replies: 1 comment

Uh oh!

giladgd Jan 21, 2026 Maintainer

ittaboba
Jan 19, 2026

giladgd
Jan 21, 2026
Maintainer