feat(training): refactor training module to new adapter/executor architecture#28
Open
feat(training): refactor training module to new adapter/executor architecture#28
Conversation
feat: add Streamlit dashboard for InfiniMetrics
Collaborator
|
设计文档中写的是 我看json中有个framework unknown, 要不这种情况就不返回这个字段了吧? |
Comment on lines
+86
to
+93
| from infinimetrics.training.frameworks.infinitrain_impl import ( | ||
| InfinitrainImpl, | ||
| ) | ||
|
|
||
| self.runner = InfinitrainImpl(config, resolved_device_count, self.run_id) | ||
| logger.info( | ||
| f"Created InfiniTrain implementation (placeholder) with {resolved_device_count} devices" | ||
| ) |
Collaborator
There was a problem hiding this comment.
如果是只是占位的话,我觉得直接在这里raise NotImplementedError("InfiniTrain implementation is not ready yet") 也行,把infinitrain_impl删掉
Comment on lines
+162
to
+167
| except Exception as e: | ||
| logger.error(f"Training failed: {e}", exc_info=True) | ||
| return self._create_error_response( | ||
| str(e), test_input=test_dict, result_code=1 | ||
| ) | ||
|
|
Collaborator
There was a problem hiding this comment.
adapter层的报错会统一在executor层处理,所以不要在这里把exception吞掉,可以参考inference_adapter处理逻辑
Comment on lines
+335
to
+351
| for it, val in sorted(metrics["losses_by_iter"].items()): | ||
| f.write(f"{it},{val}\n") | ||
|
|
||
| # Save PPL | ||
| with open(self.ppl_csv, "w") as f: | ||
| f.write("iteration,ppl\n") | ||
| for it, loss in sorted(metrics["losses_by_iter"].items()): | ||
| try: | ||
| ppl = float(math.exp(loss)) | ||
| except Exception: | ||
| ppl = float("inf") | ||
| f.write(f"{it},{ppl}\n") | ||
|
|
||
| # Save throughput | ||
| with open(self.throughput_csv, "w") as f: | ||
| f.write("iteration,throughput\n") | ||
| for it, val in sorted(metrics["throughput_by_iter"].items()): |
Collaborator
There was a problem hiding this comment.
这个地方不能用sort, 我们需要它保留时间顺序,而不是按大小顺序排列
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class InfinitrainImpl: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
重构训练模块,使其符合 InfiniMetrics 新的适配器架构设计
Changes
TrainingAdapter,继承BaseAdapter实现统一测试执行MegatronImpl实现 Megatron-LM 训练支持InfinitrainImpl占位实现(待后续开发)Test Results