Fix Phoenix H200 scheduling and monitor script segfault#1135
Fix Phoenix H200 scheduling and monitor script segfault#1135sbryngelson wants to merge 12 commits intoMFlowCode:masterfrom
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
read -t 0.1 (sub-second timeout) in a loop with process substitution file descriptors triggers a bash internal error (unwind_frame_run: read_builtin: frame not found) leading to a segfault. Use integer timeout (read -t 1) instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
Nitpicks 🔍
|
|
CodeAnt AI finished reviewing your PR. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRefactors SLURM monitoring to use state-driven polling, improves exit-code extraction and tailing behavior, adds cancellation on abnormal exit; updates GPU SBATCH requests to Changes
Sequence Diagram(s)sequenceDiagram
participant Monitor as monitor_slurm_job.sh
participant SlurmCtl as squeue/sacct/scontrol
participant JobFS as Job output file (filesystem)
participant Tail as tail process
Monitor->>SlurmCtl: get_job_state(job_id) (squeue → sacct fallback)
SlurmCtl-->>Monitor: STATE (PENDING/RUNNING/COMPLETED/UNKNOWN)
alt STATE PENDING/CONFIGURING
Monitor->>Monitor: sleep (~10s), continue polling
else STATE RUNNING/COMPLETING
Monitor->>Tail: start non-blocking tail (1s timeout, burst cap)
Tail->>JobFS: read new lines
JobFS-->>Tail: new output
Tail-->>Monitor: heartbeat + output
else STATE TERMINAL
Monitor->>Tail: stop tail, drain remaining lines (1s timeout, cap)
Monitor->>SlurmCtl: scontrol show job -> ExitCode (fallback sacct)
SlurmCtl-->>Monitor: ExitCode or UNKNOWN
alt ExitCode == 0
Monitor-->>Monitor: set monitor_success, exit 0
else
Monitor->>SlurmCtl: scancel job_id (if needed)
Monitor-->>Monitor: exit non-zero
end
else STATE UNKNOWN
Monitor->>Monitor: periodic warning, longer sleep, retry
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Updates the Phoenix (GT) GitHub Actions SLURM submission scripts to target H200 GPU resources for improved scheduling, and adjusts the SLURM job monitor script to avoid a Bash segfault caused by fractional read -t timeouts.
Changes:
- Update Phoenix GPU
sbatchdirectives to request H200 GPUs and adjust task count per node. - Replace fractional
read -ttimeouts with integer timeouts in the SLURM job monitor script.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| .github/workflows/phoenix/submit.sh | Switch GPU submission options to request H200 GPUs and increase --ntasks-per-node. |
| .github/workflows/phoenix/submit-bench.sh | Same as above for benchmark submissions. |
| .github/scripts/monitor_slurm_job.sh | Use integer read -t timeouts to prevent Bash segfaults during output streaming. |
PR MFlowCode#1124 changed bench.yml to use workflow_run (triggered after Test Suite completes), which broke the approve-to-run flow for fork PRs. Revert to the original pull_request + pull_request_review triggers while keeping improvements (frontier_amd matrix, concurrency group, timeout, run_parallel_benchmarks.sh). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1135 +/- ##
=======================================
Coverage 44.03% 44.03%
=======================================
Files 70 70
Lines 20649 20649
Branches 2053 2054 +1
=======================================
Hits 9093 9093
Misses 10368 10368
Partials 1188 1188 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Write failed test UUIDs to tests/failed_uuids.txt after a test run. In CI, if 1-5 tests fail, automatically re-run just those tests. If 6+ fail, treat it as a real issue and fail immediately. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/workflows/test.yml">
<violation number="1" location=".github/workflows/test.yml:138">
P2: The initial test run is forced to succeed with `|| true`, so real failures can be masked when `tests/failed_uuids.txt` is missing. Preserve the exit code and fail the job if no retry is performed.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/test.yml:
- Around line 137-153: The test step currently uses "|| true" after invoking
mfc.sh which hides crashes; remove that and capture the test command's exit
status (run mfc.sh test -v ... and set TEST_EXIT=$? or use if ! ...; then
TEST_EXIT=$? else TEST_EXIT=0) so you can decide later; keep the existing retry
logic that checks tests/failed_uuids.txt (NUM_FAILED, FAILED) and, after that
block, if tests/failed_uuids.txt does not exist and TEST_EXIT is non‑zero then
exit with TEST_EXIT (or otherwise propagate the original non‑zero exit code) to
avoid reporting success on a crash. Ensure you reference the same mfc.sh
invocation and tests/failed_uuids.txt and use TEST_EXIT when making the final
exit decision.
🧹 Nitpick comments (2)
toolchain/mfc/test/test.py (1)
209-216: Race condition:failed_testsis mutated from worker threads without synchronization.
failed_testsis appended to inhandle_case(line 505) which runs in worker threads. While CPython's GIL makeslist.appendatomic, iterating overfailed_testshere (line 213) is safe only becausesched.sched(line 179) has already joined all workers by this point. This is fine but fragile — a future refactor that moves this block could introduce a bug. A brief comment would help..github/workflows/test.yml (1)
144-144: Minor: useless use ofcat.
tr '\n' ' ' < tests/failed_uuids.txtis slightly cleaner, though this is purely cosmetic.
Don't mask non-zero exit codes when tests crash before writing failed_uuids.txt. Only suppress the exit code when the file exists (meaning the test framework ran to completion and we can retry). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace squeue exit-code polling with get_job_state() that parses the actual state string (squeue + sacct fallback). Never give up on UNKNOWN state — CI timeout is the backstop. Cancel orphaned SLURM jobs on abnormal monitor exit. Include job state in heartbeats. Incorporates changes from PR MFlowCode#1140. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/scripts/monitor_slurm_job.sh">
<violation number="1" location=".github/scripts/monitor_slurm_job.sh:40">
P2: Guard the `squeue` pipeline so transient command failures don't abort the script under `set -euo pipefail`; otherwise a temporary SLURM outage exits the monitor instead of returning "UNKNOWN".</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Use -n -X -P flags with sacct: -X restricts to job allocation (not steps), -P gives pipe-delimited output for reliable parsing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/scripts/monitor_slurm_job.sh (1)
162-185:⚠️ Potential issue | 🟡 MinorStale comment: tail is already stopped at this point.
Line 167 says "Wait for output file to finish growing (stabilize) before stopping tail," but tail was already killed on line 164 and the fd closed on line 163. The actual purpose of this block is to wait for NFS to flush before the final
caton line 189.📝 Suggested comment fix
-# Wait for output file to finish growing (stabilize) before stopping tail +# Wait for output file to stabilize (e.g. NFS flush) before final cat
🧹 Nitpick comments (1)
.github/scripts/monitor_slurm_job.sh (1)
59-66: Comprehensive terminal state coverage.Good list. One very minor note: SLURM also defines
REVOKEDas a terminal state (for federated jobs), but it's extremely rare and unlikely to appear in practice.
With pipefail, a transient squeue failure would exit the script instead of falling through to return UNKNOWN. Add || true to both pipelines. Also fix stale comment about tail stopping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
shutil.rmtree can fail with "Directory not empty" on networked filesystems (Lustre) due to metadata propagation delays. Retry up to 5 times with 1s backoff before raising. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
On self-hosted runners the workspace persists between runs, so a leftover file could trigger spurious retries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bot review events (pull_request_review) were racing against and cancelling legitimate push-triggered (pull_request) benchmark runs via the shared concurrency group. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| @@ -57,14 +111,13 @@ exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1) | |||
| tail_pid=$! | |||
There was a problem hiding this comment.
Suggestion: To reliably capture the tail process PID for cleanup, start it as a background job and use a named pipe (FIFO) for I/O redirection instead of process substitution. [possible issue, importance: 7]
New proposed code:
-exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
+fifo="$(mktemp -u)"
+mkfifo "$fifo"
+
+# Start tail as a real background job so $! is the tail PID
+stdbuf -oL -eL tail -f "$output_file" >"$fifo" 2>&1 &
tail_pid=$!
+# Read tail output from FD 3
+exec 3<"$fifo"
+| failed_uuids_path = os.path.join(common.MFC_TEST_DIR, "failed_uuids.txt") | ||
| if failed_tests: | ||
| with open(failed_uuids_path, "w") as f: | ||
| for test_info in failed_tests: | ||
| f.write(test_info['uuid'] + "\n") | ||
| elif os.path.exists(failed_uuids_path): | ||
| os.remove(failed_uuids_path) |
There was a problem hiding this comment.
Suggestion: Improve the robustness of writing the failed_uuids.txt file by using .get() to avoid a KeyError and by writing to a temporary file before atomically replacing the final file. [possible issue, importance: 6]
| failed_uuids_path = os.path.join(common.MFC_TEST_DIR, "failed_uuids.txt") | |
| if failed_tests: | |
| with open(failed_uuids_path, "w") as f: | |
| for test_info in failed_tests: | |
| f.write(test_info['uuid'] + "\n") | |
| elif os.path.exists(failed_uuids_path): | |
| os.remove(failed_uuids_path) | |
| failed_uuids_path = os.path.join(common.MFC_TEST_DIR, "failed_uuids.txt") | |
| tmp_failed_uuids_path = failed_uuids_path + ".tmp" | |
| if failed_tests: | |
| uuids = [t.get("uuid") for t in failed_tests if isinstance(t, dict) and t.get("uuid")] | |
| with open(tmp_failed_uuids_path, "w") as f: | |
| for uuid in uuids: | |
| f.write(uuid + "\n") | |
| os.replace(tmp_failed_uuids_path, failed_uuids_path) | |
| else: | |
| if os.path.exists(tmp_failed_uuids_path): | |
| os.remove(tmp_failed_uuids_path) | |
| if os.path.exists(failed_uuids_path): | |
| os.remove(failed_uuids_path) |
| if [ -f tests/failed_uuids.txt ]; then | ||
| NUM_FAILED=$(wc -l < tests/failed_uuids.txt) | ||
| if [ "$NUM_FAILED" -le 5 ]; then | ||
| FAILED=$(cat tests/failed_uuids.txt | tr '\n' ' ') | ||
| echo "" | ||
| echo "=== Retrying $NUM_FAILED failed test(s): $FAILED ===" | ||
| echo "" | ||
| /bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) --only $FAILED $TEST_ALL | ||
| else | ||
| echo "Too many failures ($NUM_FAILED) to retry — likely a real issue." | ||
| exit 1 | ||
| fi | ||
| elif [ "$TEST_EXIT" -ne 0 ]; then | ||
| exit $TEST_EXIT | ||
| fi |
There was a problem hiding this comment.
Suggestion: Add a check to ensure failed_uuids.txt is not empty before attempting to retry tests, preventing the --only flag from being called with no arguments. [possible issue, importance: 7]
| if [ -f tests/failed_uuids.txt ]; then | |
| NUM_FAILED=$(wc -l < tests/failed_uuids.txt) | |
| if [ "$NUM_FAILED" -le 5 ]; then | |
| FAILED=$(cat tests/failed_uuids.txt | tr '\n' ' ') | |
| echo "" | |
| echo "=== Retrying $NUM_FAILED failed test(s): $FAILED ===" | |
| echo "" | |
| /bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) --only $FAILED $TEST_ALL | |
| else | |
| echo "Too many failures ($NUM_FAILED) to retry — likely a real issue." | |
| exit 1 | |
| fi | |
| elif [ "$TEST_EXIT" -ne 0 ]; then | |
| exit $TEST_EXIT | |
| fi | |
| if [ -f tests/failed_uuids.txt ]; then | |
| NUM_FAILED=$(wc -l < tests/failed_uuids.txt) | |
| if [ "$NUM_FAILED" -ge 1 ] && [ "$NUM_FAILED" -le 5 ]; then | |
| FAILED=$(tr '\n' ' ' < tests/failed_uuids.txt) | |
| echo "" | |
| echo "=== Retrying $NUM_FAILED failed test(s): $FAILED ===" | |
| echo "" | |
| /bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) --only $FAILED $TEST_ALL | |
| elif [ "$NUM_FAILED" -eq 0 ]; then | |
| if [ "$TEST_EXIT" -ne 0 ]; then | |
| exit $TEST_EXIT | |
| fi | |
| else | |
| echo "Too many failures ($NUM_FAILED) to retry — likely a real issue." | |
| exit 1 | |
| fi | |
| elif [ "$TEST_EXIT" -ne 0 ]; then | |
| exit $TEST_EXIT | |
| fi |
User description
User description
Summary
monitor_slurm_job.shcaused by fractionalread -t 0.1timeout interacting with process substitution file descriptors (unwind_frame_run: read_builtin: frame not found). Changed to integerread -t 1.Test plan
🤖 Generated with Claude Code
PR Type
Bug fix, Enhancement
Description
Fix bash segfault in monitor script from fractional read timeout
read -t 0.1andread -t 0.5toread -t 1(integer)Switch Phoenix GPU jobs to H200 nodes for faster scheduling
Diagram Walkthrough
File Walkthrough
monitor_slurm_job.sh
Fix bash segfault from fractional read timeout.github/scripts/monitor_slurm_job.sh
read -t 0.1toread -t 1in main monitoring loopread -t 0.5toread -t 1in output drain loopsubstitution
submit.sh
Switch Phoenix GPU jobs to H200 nodes.github/workflows/phoenix/submit.sh
-p gpu-v100,gpu-a100,gpu-h100,gpu-l40sto--gres=gpu:H200:2submit-bench.sh
Switch benchmark GPU jobs to H200 nodes.github/workflows/phoenix/submit-bench.sh
-CL40Sconstraint to--gres=gpu:H200:2CodeAnt-AI Description
Harden SLURM monitoring, add targeted test retries, and switch Phoenix GPU jobs to H200
What Changed
Impact
✅ Fewer orphaned SLURM jobs✅ Fewer CI flakes by retrying sporadic test failures✅ Faster Phoenix GPU job scheduling💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.