feat(cli_demo): add native multi-GPU support via device_map#815
Draft
IMaloney wants to merge 4 commits intozai-org:mainfrom
Draft
feat(cli_demo): add native multi-GPU support via device_map#815IMaloney wants to merge 4 commits intozai-org:mainfrom
IMaloney wants to merge 4 commits intozai-org:mainfrom
Conversation
added 4 commits
February 19, 2026 03:15
Add --device_map CLI argument to enable multi-GPU inference without
code modifications. This implements the functionality described in
existing comments at lines 92-93 and 146-150.
Changes:
- Add device_map parameter to generate_video() with options:
- None (default): Uses sequential CPU offload (backward compatible)
- 'balanced': Distributes model evenly across GPUs (recommended)
- 'auto': Automatic device placement by accelerate
- 'sequential': Fills GPUs one by one in order
- Conditionally enable CPU offload only when device_map is None
- Pass device_map to from_pretrained() for all three pipeline types:
CogVideoXPipeline, CogVideoXImageToVideoPipeline, CogVideoXVideoToVideoPipeline
- Update module docstring with multi-GPU usage examples
- Add validation for device_map values
Backward compatibility:
- Default behavior unchanged (device_map=None uses CPU offload)
- All existing scripts work without modification
Usage:
# Multi-GPU with balanced distribution (recommended):
python cli_demo.py --prompt '...' --model_path THUDM/CogVideoX1.5-5b \
--device_map balanced
# Multi-GPU with auto placement:
python cli_demo.py --prompt '...' --model_path THUDM/CogVideoX1.5-5b \
--device_map auto
- Add comprehensive tests for device_map argument functionality - Test all device_map options: None, auto, balanced, sequential - Test all pipeline types: t2v, i2v, v2v - Verify CPU offload logic (enabled when device_map=None, disabled otherwise) - Verify from_pretrained receives correct device_map parameter - Test VAE optimizations are always enabled regardless of device_map - Test invalid device_map values raise ValueError - Verify backward compatibility (default behavior unchanged) - Mock heavy dependencies (torch, diffusers) for fast CI execution 47 tests covering all multi-GPU device_map scenarios
Author
|
Hi @fengxiaolonger I will have some time later tonight/ this weekend to test. Once done I'll include results! |
Author
|
Hi @fengxiaolonger I've tested this on 2x identical RTX 4090s with all 47 unit tests passing. I attempted to find mixed-GPU cloud instances (different VRAM sizes) to test the edge case you mentioned, but they're very rare in cloud environments, it seems like most providers use identical hardware in their offerings. I am going to buy something that will allow me to test this but gathering parts to make this work will require more time than I thought -- probably a month, is that ok? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
adds --device_map argument (auto/balanced/sequential) to distribute model layers across GPUs
removes need for manual code edits mentioned in comments, backward compatible
tested with multi-GPU setup, 47 tests pass, pre-commit clean