Skip to content

Add inference and communication test support#27

Open
zzhfz wants to merge 2 commits intomasterfrom
feat/unified-scripts
Open

Add inference and communication test support#27
zzhfz wants to merge 2 commits intomasterfrom
feat/unified-scripts

Conversation

@zzhfz
Copy link
Contributor

@zzhfz zzhfz commented Feb 26, 2026

Description

统一脚本管理,支持推理、通信测试的一体化运行

Changes

  • 增强 install_deps.sh,支持 -/inference/comm 组件安装
  • 更新 run_tests.sh,支持组件化依赖检查

Test Results

  • inference
(megatron) sunjinge@server:~/InfiniMetrics$ ./scripts/run_tests.sh --check inference test_inference.json
==========================================
InfiniMetrics Test Runner
==========================================

Checking inference dependencies:
  vLLM... [OK]
  InfiniLM... [OK] (version unknown)

Environment: INFINI_ROOT=/home/sunjinge/.infini
NCCL_ROOT=/home/sunjinge/.infini/nccl
CUDA_VISIBLE_DEVICES=not set


==========================================
Running: InfiniMetrics
Time: 20260226_024950
==========================================
Running 1 input path(s):
  - test_inference.json
  ... ... ... 
2026-02-26 02:51:17,875 - infinimetrics.utils.accelerator_monitor - INFO - nvidia monitoring stopped
2026-02-26 02:51:18,238 - infinimetrics.inference.frameworks.vllm_impl - INFO - vLLM model unloaded
2026-02-26 02:51:18,238 - infinimetrics.inference.direct - INFO - Model unloaded
2026-02-26 02:51:18,359 - infinimetrics.inference.inference_adapter - INFO - Inference adapter teardown complete
2026-02-26 02:51:18,361 - infinimetrics.executor - INFO - Executor: infer.vLLM.Direct completed with code=0
2026-02-26 02:51:18,361 - infinimetrics.dispatcher - INFO - Summary saved to summary_output/dispatcher_summary_20260226_025118.json

============================================================
Test Summary
============================================================
Total tests:   1
Successful:    1
Failed:        0
Success rate:  100.0%
============================================================


==========================================
[OK] Test completed: InfiniMetrics
Time: 20260226_025119
==========================================
  • comm
(megatron) sunjinge@server:~/InfiniMetrics$ ./scripts/run_tests.sh test_comm.json
==========================================
InfiniMetrics Test Runner
==========================================

Environment: INFINI_ROOT=/home/sunjinge/.infini
NCCL_ROOT=/home/sunjinge/.infini/nccl
CUDA_VISIBLE_DEVICES=not set


==========================================
Running: InfiniMetrics
Time: 20260226_025208
==========================================
Running 1 input path(s):
  - test_comm.json

2026-02-26 02:52:08,405 - infinimetrics.utils.input_loader - INFO - Loaded 1 input(s) from test_comm.json
2026-02-26 02:52:08,405 - infinimetrics.dispatcher - INFO - Processing 1 valid inputs (skipped 0 invalid)
2026-02-26 02:52:08,415 - infinimetrics.dispatcher - INFO - Validation complete: 1 valid, 0 skipped
2026-02-26 02:52:08,415 - infinimetrics.dispatcher - INFO - [1/1] Executing comm.NcclTest.AllReduce
2026-02-26 02:52:08,415 - infinimetrics.executor - INFO - Executor: Running comm.NcclTest.AllReduce
2026-02-26 02:52:08,416 - infinimetrics.communication.nccl_adapter - INFO - Set CUDA_VISIBLE_DEVICES=0,1
2026-02-26 02:52:08,416 - infinimetrics.communication.nccl_adapter - INFO - NCCL tests found at: /home/sunjinge/InfiniMetrics/submodules/nccl-tests
2026-02-26 02:52:11,508 - infinimetrics.communication.nccl_adapter - INFO - NCCL adapter teardown complete
2026-02-26 02:52:11,510 - infinimetrics.executor - INFO - Executor: comm.NcclTest.AllReduce completed with code=0
2026-02-26 02:52:11,510 - infinimetrics.dispatcher - INFO - Summary saved to summary_output/dispatcher_summary_20260226_025211.json

============================================================
Test Summary
============================================================
Total tests:   1
Successful:    1
Failed:        0
Success rate:  100.0%
============================================================


==========================================
[OK] Test completed: InfiniMetrics
Time: 20260226_025211
==========================================
  • install test
(megatron) sunjinge@server:~/InfiniMetrics$ ./scripts/common/install_deps.sh inference
==========================================
InfiniMetrics Dependency Manager
==========================================


==========================================
Inference Frameworks
==========================================

Installing vLLM...
[OK] vLLM already installed

Installing InfiniLM...
[OK] InfiniLM already installed

==========================================
[OK] All operations completed successfully!
==========================================
(megatron) sunjinge@server:~/InfiniMetrics$ ./scripts/common/install_deps.sh comm
==========================================
InfiniMetrics Dependency Manager
==========================================


==========================================
NCCL Tests (Communication)
==========================================

Checking dependencies...
  CUDA... [OK] (nvcc 13.0)

Checking NCCL tests...
[OK] NCCL tests already built

==========================================
[OK] All operations completed successfully!
==========================================

@zzhfz zzhfz requested a review from baominghelly February 26, 2026 06:11
export LD_LIBRARY_PATH="$INFINI_ROOT/lib:$LD_LIBRARY_PATH"
export NCCL_ROOT="$INFINI_ROOT/nccl"
export PATH="$NCCL_ROOT/bin:$PATH"
export LD_LIBRARY_PATH="$NCCL_ROOT/lib:$LD_LIBRARY_PATH"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 NCCL_ROOT 假定的不太合适,用户不会把 nccl 装到 INFINI_ROOT 下面,这几行可以删了

}

# Check NCCL
check_nccl() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分我感觉也可以去掉,nccl 应该是基本要求,可以默认用户都装了。另外如果要检查也得先检查头文件,再检查动态库,也有点繁琐


# Check Megatron-LM
check_megatron() {
echo -n " Megatron-LM... "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块加个 TODO 或者 FIXME 吧,有可能直接拿源码运行了,而不是通过调用 megatron 的 python 包,等之后这块合进来了得再改下,感觉跟 nccl-tests 一样当作 submodule 引入就行

# Check InfiniTrain
check_infinitrain() {
echo -n " InfiniTrain... "
if check_python_package infinitrain; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块也加个 TODO,infinitrain 暂时应该也没 python 包

local INFINITRAIN_PATH="${INFINITRAIN_PATH:-$PROJECT_ROOT/submodules/InfiniTrain}"
if [ -d "$INFINITRAIN_PATH" ]; then
cd "$INFINITRAIN_PATH"
if pip install -e .; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块加个 TODO,之后换成实际的 build 方式

;;
comm)
echo "Checking communication dependencies:"
check_nccl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块也删了吧

all)
echo "Checking all dependencies:"
check_cuda
check_nccl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还有这里

# ========================================
# Inference Frameworks
# ========================================
install_inference() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InfiniCore是InfiniLM的依赖,在安装InfiniLM之前是不是应该先调check_infinicore检查一下呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants