Not sure how --offload-to-cpu works. #1308

En3Tho · 2026-03-02T07:57:41Z

En3Tho
Mar 2, 2026

I'm trying to run qwen image on 9070xt (16 gb vram total, 15 actually available)
diffusion model: qwen-image-Q5_K_M.gguf (14gb)
llm: Qwen2.5-VL-7B-Instruct-UD-Q4_K_XL.gguf (6gb)

This setup requires 20gb total and it seems that Vulkan impl (release b314d80)) is trying to allocate the whole 20gb in gpu vram and it fails with ErrorOutOfDeviceMemory. I though that --offload-to-cpu would somehow help with this (I have 64gb ram, around 40 avaliable).

.\sd-cli --diffusion-model "D:\Downloads\qwen-image-Q5_K_M.gguf" --vae "D:\Downloads\qwen_image_vae.safetensors" --llm "D:\Downloads\Qwen2.5-VL-7B-Instruct-UD-Q4_K_XL.gguf" -o "stablediffusioncpp/cat.png" -p "a lovely cat" --offload-to-cpu

[INFO ] stable-diffusion.cpp:879  - total params memory size = 20362.97MB (VRAM 20362.97MB, RAM 0.00MB): text_encoders 5918.09MB(VRAM), diffusion_model 14305.05MB(VRAM), vae 139.84MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:944  - running in FLOW mode
[INFO ] stable-diffusion.cpp:3549 - sampling using Euler method
[INFO ] denoiser.hpp:494  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3676 - TXT2IMG
ggml_vulkan: Device memory allocation of size 2179989504 failed.
ggml_vulkan: Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory
[ERROR] ggml_extend.hpp:84   - alloc_tensor_range: failed to allocate Vulkan0 buffer of size 2179989504
[ERROR] ggml_extend.hpp:1840 - qwen2.5vl alloc runtime params backend buffer failed, num_tensors = 338
[ERROR] ggml_extend.hpp:2023 - qwen2.5vl offload params to runtime backend failed

Suspicious part is RAM 0.00 MB - it's like there is no hint that ram can be used.

Another question I have: would it be possible to use second gpu for the LLM part? I have dual setup with an older 1070 with 8gb VRAM and Qwen2.5-VL-7B-Instruct-UD-Q4_K_XL.gguf can easily fit there with somewhat decent speed (llama.cpp Vulkan impl outputs around 35t/s). Would it be hard to implement?

wbruna · 2026-03-02T14:30:58Z

wbruna
Mar 2, 2026

Note that 2179989504 is 2079 MB. Maybe this error was triggered because the VRAM filled up, not because of a single allocation. The inference also needs a working buffer: if you have a ~14G model, and ~15G available VRAM, the ~1G left may not be enough for it. I suggest trying --diffusion-fa, and/or a smaller quant.

Suspicious part is RAM 0.00 MB - it's like there is no hint that ram can be used.

That message doesn't really reflect --offload-to-cpu behavior: it shows the total memory for all models during inference. RAM is 0 in this case because you didn't use --vae-on-cpu nor --clip-on-cpu. It's basically "what you would need without offload-to-cpu".

--offload-to-cpu keeps the model weights cached on RAM, moving them to VRAM on demand, to avoid spending VRAM on inactive models (e.g. keeping the VAE and conditioner weights away during inference). That message could arguably be considered buggy in this case.

Another question I have: would it be possible to use second gpu for the LLM part?

Not on sd.cpp mainline, but there is a PR that implements it: #1184

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not sure how --offload-to-cpu works. #1308

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Not sure how --offload-to-cpu works. #1308

Uh oh!

Uh oh!

En3Tho Mar 2, 2026

Replies: 1 comment

Uh oh!

wbruna Mar 2, 2026

En3Tho
Mar 2, 2026

wbruna
Mar 2, 2026