Skip to content

feat(training): refactor training module to new adapter/executor architecture#28

Open
zzhfz wants to merge 2 commits intomasterfrom
feat/training-refactor
Open

feat(training): refactor training module to new adapter/executor architecture#28
zzhfz wants to merge 2 commits intomasterfrom
feat/training-refactor

Conversation

@zzhfz
Copy link
Contributor

@zzhfz zzhfz commented Feb 28, 2026

Description

重构训练模块,使其符合 InfiniMetrics 新的适配器架构设计

Changes

  • 新增 TrainingAdapter,继承 BaseAdapter 实现统一测试执行
  • MegatronImpl 实现 Megatron-LM 训练支持
  • InfinitrainImpl 占位实现(待后续开发)
  • 更新 testcase_utils,支持训练类型的 testcase
  • 在 dispatcher 中注册训练适配器

Test Results

  • 输入json
{ 
  "run_id": "train.megatron.gpt.20240208_101",
  "testcase": "train.megatron.GPT",
  "config": {
    "output_dir": "./output",
    "framework": "megatron",
    "model": "gpt",
    "megatron_path": "/home/sunjinge/Megatron-LM",
    "device": {
      "gpu_platform": "nvidia",
      "device_ids": [0, 1]
    },
    "train_dataset": "mock",
    "warmup_iterations": 2,
    "train_args": {
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": 1,
        "sp": 0
      },
      "mbs": 2,
      "gbs": 8,
      "seq_len": 128,
      "lr": 0.00015,
      "train_iters": 10,
      "num_layers": 2,
      "hidden_size": 512,
      "num_attention_heads": 8,
      "vocab_size": 128256,
      "max_position_embeddings": 128,
      "precision": "bf16",
      "optimizer": "adam",
      "weight_decay": 0.01,
      "clip_grad": 1.0,
      "beta1": 0.9,
      "beta2": 0.95,
      "lr_scheduler": "cosine",
      "min_lr": 0.0,
      "eval_interval": 100,
      "eval_iters": 10,
      "save_interval": 1000,
      "extra_args": []
    }
  }
}
  • 运行结果
(megatron) sunjinge@server:~/InfiniMetrics$ python main.py test_training.json 
2026-02-28 03:13:09,512 - infinimetrics.utils.input_loader - INFO - Loaded 1 input(s) from test_training.json
2026-02-28 03:13:09,512 - infinimetrics.dispatcher - INFO - Processing 1 valid inputs (skipped 0 invalid)
2026-02-28 03:13:09,517 - infinimetrics.dispatcher - INFO - Validation complete: 1 valid, 0 skipped
2026-02-28 03:13:09,517 - infinimetrics.dispatcher - INFO - [1/1] Executing train.megatron.GPT
2026-02-28 03:13:09,517 - infinimetrics.executor - INFO - Executor: Running train.megatron.GPT
.......... ............. .......
2026-02-28 02:47:19,399 - infinimetrics.training.frameworks.megatron_impl - INFO -     evaluate .......................................: (318.84, 318.84)
2026-02-28 02:47:19,399 - infinimetrics.training.frameworks.megatron_impl - INFO - WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode validate_results
2026-02-28 02:47:19,399 - infinimetrics.training.frameworks.megatron_impl - INFO - WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode validate_results
2026-02-28 02:47:19,399 - infinimetrics.training.frameworks.megatron_impl - INFO - ----------------------------------------------------------------------------------------------------------
2026-02-28 02:47:19,399 - infinimetrics.training.frameworks.megatron_impl - INFO -  validation loss at iteration 10 on test set | lm loss value: 1.132849E+01 | lm loss PPL: 8.315699E+04 | 
2026-02-28 02:47:19,399 - infinimetrics.training.frameworks.megatron_impl - INFO - ----------------------------------------------------------------------------------------------------------
2026-02-28 02:47:19,911 - infinimetrics.training.frameworks.megatron_impl - INFO - [rank0]:[W228 02:47:19.614357547 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
2026-02-28 02:47:23,065 - infinimetrics.training.frameworks.megatron_impl - INFO - Results saved to output/training
2026-02-28 02:47:23,065 - infinimetrics.training.training_adapter - INFO - Training completed successfully: train.megatron.GPT
2026-02-28 02:47:23,656 - infinimetrics.utils.accelerator_monitor - INFO - nvidia monitoring stopped
2026-02-28 02:47:23,802 - infinimetrics.training.training_adapter - INFO - TrainingAdapter teardown complete
2026-02-28 02:47:23,804 - infinimetrics.executor - INFO - Executor: train.megatron.GPT completed with code=0
2026-02-28 02:47:23,805 - infinimetrics.dispatcher - INFO - Summary saved to summary_output/dispatcher_summary_20260228_024723.json

============================================================
Test Summary
============================================================
Total tests:   1
Successful:    1
Failed:        0
Success rate:  100.0%
============================================================
  • 输出json
(megatron) sunjinge@server:~/InfiniMetrics$ cat output/train/train.megatron.gpt.20240208_101_results.json
{
  "run_id": "train.megatron.gpt.20240208_101",
  "time": "2026-02-28 03:13:23",
  "testcase": "train.megatron.GPT",
  "success": 0,
  "environment": {
    "cluster_scale": 1,
    "topology": "2x1 ring mesh",
    "cluster": [
      {
        "machine": {
          "cpu_model": "Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz",
          "memory_gb": 2015,
          "accelerators": [
            {
              "model": "NVIDIA A100-SXM4-80GB",
              "count": 2,
              "memory_gb_per_card": 80,
              "driver": "580.105.08",
              "cuda": "13.0",
              "type": "nvidia"
            }
          ]
        },
        "framework": [
          {
            "name": "unknown",
            "version": "unknown"
          }
        ]
      }
    ]
  },
  "result_code": 0,
  "config": {
    "output_dir": "./output",
    "framework": "megatron",
    "model": "gpt",
    "megatron_path": "/home/sunjinge/Megatron-LM",
    "device": {
      "gpu_platform": "nvidia",
      "device_ids": [
        0,
        1
      ]
    },
    "train_dataset": "mock",
    "warmup_iterations": 2,
    "train_args": {
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": 1,
        "sp": 0
      },
      "mbs": 2,
      "gbs": 8,
      "seq_len": 128,
      "lr": 0.00015,
      "train_iters": 10,
      "num_layers": 2,
      "hidden_size": 512,
      "num_attention_heads": 8,
      "vocab_size": 128256,
      "max_position_embeddings": 128,
      "precision": "bf16",
      "optimizer": "adam",
      "weight_decay": 0.01,
      "clip_grad": 1.0,
      "beta1": 0.9,
      "beta2": 0.95,
      "lr_scheduler": "cosine",
      "min_lr": 0.0,
      "eval_interval": 100,
      "eval_iters": 10,
      "save_interval": 1000,
      "extra_args": []
    }
  },
  "metrics": [
    {
      "name": "train.throughput",
      "type": "timeseries",
      "raw_data_url": "output/training/train.megatron.gpt.20240208_101_train_throughput.csv",
      "unit": "tokens/s/gpu"
    },
    {
      "name": "train.loss",
      "type": "timeseries",
      "raw_data_url": "output/training/train.megatron.gpt.20240208_101_train_loss.csv",
      "unit": ""
    },
    {
      "name": "train.ppl",
      "type": "timeseries",
      "raw_data_url": "output/training/train.megatron.gpt.20240208_101_train_ppl.csv",
      "unit": null
    },
    {
      "name": "train.peak_memory_usage",
      "type": "scalar",
      "value": 37.060547,
      "unit": "GB"
    }
  ],
  "resolved": {
    "nodes": 1,
    "gpus_per_node": 2,
    "device_used": 2,
    "accelerator_type": "nvidia"
  }
}
  • loss.csv示例
(megatron) sunjinge@server:~/InfiniMetrics$ cat output/training/train.megatron.gpt.20240208_101_train_loss.csv 
iteration,loss
1,11.88619
2,11.82143
3,11.74973
4,11.6724
5,11.5742
6,11.50262
7,11.45898
8,11.48349
9,11.63348
10,11.31823

@baominghelly
Copy link
Collaborator

设计文档中写的是train.megatron.SFTtrain.megatron.LoRA,你写的是train.megatron.GPT这个地方改一下,另外这两个case都可以支持不?

我看json中有个framework unknown, 要不这种情况就不返回这个字段了吧?

Comment on lines +86 to +93
from infinimetrics.training.frameworks.infinitrain_impl import (
InfinitrainImpl,
)

self.runner = InfinitrainImpl(config, resolved_device_count, self.run_id)
logger.info(
f"Created InfiniTrain implementation (placeholder) with {resolved_device_count} devices"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果是只是占位的话,我觉得直接在这里raise NotImplementedError("InfiniTrain implementation is not ready yet") 也行,把infinitrain_impl删掉

Comment on lines +162 to +167
except Exception as e:
logger.error(f"Training failed: {e}", exc_info=True)
return self._create_error_response(
str(e), test_input=test_dict, result_code=1
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adapter层的报错会统一在executor层处理,所以不要在这里把exception吞掉,可以参考inference_adapter处理逻辑

Comment on lines +335 to +351
for it, val in sorted(metrics["losses_by_iter"].items()):
f.write(f"{it},{val}\n")

# Save PPL
with open(self.ppl_csv, "w") as f:
f.write("iteration,ppl\n")
for it, loss in sorted(metrics["losses_by_iter"].items()):
try:
ppl = float(math.exp(loss))
except Exception:
ppl = float("inf")
f.write(f"{it},{ppl}\n")

# Save throughput
with open(self.throughput_csv, "w") as f:
f.write("iteration,throughput\n")
for it, val in sorted(metrics["throughput_by_iter"].items()):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方不能用sort, 我们需要它保留时间顺序,而不是按大小顺序排列

logger = logging.getLogger(__name__)


class InfinitrainImpl:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个可以暂时删了

Copy link
Collaborator

@baominghelly baominghelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请根据意见修改一下代码

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants