Replies: 1 comment
-
|
@ittaboba The behavior you noticed is most likely the saving and restoring of the KV cache (the evaluation state), which saves you time on evaluating the past chat history, so only the new prompt is evaluated. I do plan to make all of this automatically happen behind the scenes by default in the upcoming I've made a code snippet you can run that demonstrates this, run it with import fs from "node:fs/promises";
import os from "node:os";
import path from "node:path";
import crypto from "node:crypto";
import {getLlama, resolveModelFile, LlamaChatSession} from "node-llama-cpp";
const modelUri = "hf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf";
const stateFile = path.join(os.tmpdir(), crypto.randomUUID() + ".bin");
await using deleteStateFile = {
[Symbol.asyncDispose]() {
console.log("Deleting state file: " + stateFile);
return fs.rm(stateFile, {force: true});
}
};
console.log("state file: " + stateFile);
const llama = await getLlama();
const modelPath = await resolveModelFile(modelUri);
const startTime = Date.now();
const model = await llama.loadModel({modelPath});
const context = await model.createContext({
// this is the default config in Ollama for this model IIRC
contextSize: {max: 8192}
});
const endTime = Date.now();
console.log("Loaded in", (endTime - startTime) + "ms");
const sequence = context.getSequence();
const session = new LlamaChatSession({contextSequence: sequence});
const q1 = "Hi there, how are you?";
console.log("User: " + q1);
const promptStartTime = Date.now();
const a1 = await session.prompt(q1);
const promptEndTime = Date.now();
console.log("AI: " + a1);
console.log("Evaluation time", (promptEndTime - promptStartTime) + "ms");
console.log("Evaluated input tokens", sequence.tokenMeter.usedInputTokens);
await sequence.saveStateToFile(stateFile);
const chatHistory = session.getChatHistory();
session.dispose();
context.dispose();
model.dispose();
// wait 10 seconds, to see the memory pressure go down
console.log("Waiting 10 seconds before restoring state...");
await new Promise((accept) => setTimeout(accept, 1000 * 10));
// then later restore from where we left off and evaluate another prompt
// on top of the previous chat history
console.log("Restoring state from file: " + stateFile);
const startTime2 = Date.now();
const model2 = await llama.loadModel({modelPath});
const context2 = await model2.createContext({
contextSize: {max: 8192}
});
const sequence2 = context2.getSequence();
const session2 = new LlamaChatSession({contextSequence: sequence2});
session2.setChatHistory(chatHistory);
// try commenting out this line and see how the "Evaluated input tokens"
// changes to evaluate the entire history again, and the evaluation time increases
await sequence2.loadStateFromFile(stateFile, {acceptRisk: true});
const endTime2 = Date.now();
console.log("Restored in", (endTime2 - startTime2) + "ms");
const q2 = "What did I just ask you?";
console.log("User: " + q2);
const promptStartTime2 = Date.now();
const a2 = await session2.prompt(q2);
const promptEndTime2 = Date.now();
console.log("AI: " + a2);
console.log("Evaluation time", (promptEndTime2 - promptStartTime2) + "ms");
console.log("Evaluated input tokens", sequence2.tokenMeter.usedInputTokens); |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there,
I noticed Ollama allocates memory just for the time of text generation. Memory pressure before and after drops. Not sure about how they achieve it, possibly related to something like this
It doesn't seem to be the case for LM Studio: as long as the model is loaded, memory pressure stays constant.
My app uses node-llama-cpp and currently behaves like LM Studio. Is there any setup to achieve the Ollama behavior?
I thought about loading the model at the beginning of each generation and calling
dispose()at the end of it as explained here. But Ollama seems to be much more reactive, especially on bigger models like gpt-oss it doesn't seem they completely offload and reload on each generation.In any case, thanks for this great library!
Beta Was this translation helpful? Give feedback.
All reactions