Skip to content

feat(cli_demo): add native multi-GPU support via device_map#815

Draft
IMaloney wants to merge 4 commits intozai-org:mainfrom
IMaloney:feat/native-multi-gpu-support
Draft

feat(cli_demo): add native multi-GPU support via device_map#815
IMaloney wants to merge 4 commits intozai-org:mainfrom
IMaloney:feat/native-multi-gpu-support

Conversation

@IMaloney
Copy link

@IMaloney IMaloney commented Feb 19, 2026

adds --device_map argument (auto/balanced/sequential) to distribute model layers across GPUs

removes need for manual code edits mentioned in comments, backward compatible

tested with multi-GPU setup, 47 tests pass, pre-commit clean

Test User added 4 commits February 19, 2026 03:15
Add --device_map CLI argument to enable multi-GPU inference without
code modifications. This implements the functionality described in
existing comments at lines 92-93 and 146-150.

Changes:
- Add device_map parameter to generate_video() with options:
  - None (default): Uses sequential CPU offload (backward compatible)
  - 'balanced': Distributes model evenly across GPUs (recommended)
  - 'auto': Automatic device placement by accelerate
  - 'sequential': Fills GPUs one by one in order
- Conditionally enable CPU offload only when device_map is None
- Pass device_map to from_pretrained() for all three pipeline types:
  CogVideoXPipeline, CogVideoXImageToVideoPipeline, CogVideoXVideoToVideoPipeline
- Update module docstring with multi-GPU usage examples
- Add validation for device_map values

Backward compatibility:
- Default behavior unchanged (device_map=None uses CPU offload)
- All existing scripts work without modification

Usage:
  # Multi-GPU with balanced distribution (recommended):
  python cli_demo.py --prompt '...' --model_path THUDM/CogVideoX1.5-5b \
    --device_map balanced

  # Multi-GPU with auto placement:
  python cli_demo.py --prompt '...' --model_path THUDM/CogVideoX1.5-5b \
    --device_map auto
- Add comprehensive tests for device_map argument functionality
- Test all device_map options: None, auto, balanced, sequential
- Test all pipeline types: t2v, i2v, v2v
- Verify CPU offload logic (enabled when device_map=None, disabled otherwise)
- Verify from_pretrained receives correct device_map parameter
- Test VAE optimizations are always enabled regardless of device_map
- Test invalid device_map values raise ValueError
- Verify backward compatibility (default behavior unchanged)
- Mock heavy dependencies (torch, diffusers) for fast CI execution

47 tests covering all multi-GPU device_map scenarios
@IMaloney IMaloney marked this pull request as draft February 20, 2026 06:29
@IMaloney
Copy link
Author

Hi @fengxiaolonger I will have some time later tonight/ this weekend to test. Once done I'll include results!

@IMaloney
Copy link
Author

IMaloney commented Feb 27, 2026

Hi @fengxiaolonger I've tested this on 2x identical RTX 4090s with all 47 unit tests passing. I attempted to find mixed-GPU cloud instances (different VRAM sizes) to test the edge case you mentioned, but they're very rare in cloud environments, it seems like most providers use identical hardware in their offerings. I am going to buy something that will allow me to test this but gathering parts to make this work will require more time than I thought -- probably a month, is that ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant