Skip to content

Fix Phoenix H200 scheduling and monitor script segfault#1135

Open
sbryngelson wants to merge 12 commits intoMFlowCode:masterfrom
sbryngelson:ci-fixes
Open

Fix Phoenix H200 scheduling and monitor script segfault#1135
sbryngelson wants to merge 12 commits intoMFlowCode:masterfrom
sbryngelson:ci-fixes

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Feb 12, 2026

User description

User description

Summary

  • Switch Phoenix GPU jobs to H200 nodes for faster scheduling
  • Fix bash segfault in monitor_slurm_job.sh caused by fractional read -t 0.1 timeout interacting with process substitution file descriptors (unwind_frame_run: read_builtin: frame not found). Changed to integer read -t 1.

Test plan

  • Phoenix GPU tests pass without segfault in monitor script
  • Phoenix benchmark jobs complete normally

🤖 Generated with Claude Code

PR Type

Bug fix, Enhancement

Description

  • Fix bash segfault in monitor script from fractional read timeout

    • Changed read -t 0.1 and read -t 0.5 to read -t 1 (integer)
    • Resolves unwind_frame_run error with process substitution file descriptors
  • Switch Phoenix GPU jobs to H200 nodes for faster scheduling

    • Updated submit.sh and submit-bench.sh to use H200 GPU resource
    • Increased ntasks-per-node from 4 to 8 for better resource utilization

Diagram Walkthrough

flowchart LR
  A["Monitor Script<br/>Fractional Timeouts"] -->|"Change to<br/>Integer Timeout"| B["Fixed Bash<br/>Segfault"]
  C["Phoenix GPU<br/>Job Config"] -->|"Switch to<br/>H200 Nodes"| D["Faster Job<br/>Scheduling"]
  E["Increase<br/>ntasks-per-node"] -->|"4 to 8"| D
Loading

File Walkthrough

Relevant files
Bug fix
monitor_slurm_job.sh
Fix bash segfault from fractional read timeout                     

.github/scripts/monitor_slurm_job.sh

  • Changed read -t 0.1 to read -t 1 in main monitoring loop
  • Changed read -t 0.5 to read -t 1 in output drain loop
  • Fixes bash segfault caused by fractional timeout with process
    substitution
+2/-2     
Enhancement
submit.sh
Switch Phoenix GPU jobs to H200 nodes                                       

.github/workflows/phoenix/submit.sh

  • Replaced GPU partition constraint with H200 GPU resource specification
  • Changed from -p gpu-v100,gpu-a100,gpu-h100,gpu-l40s to
    --gres=gpu:H200:2
  • Increased ntasks-per-node from 4 to 8 for better resource allocation
+2/-3     
submit-bench.sh
Switch benchmark GPU jobs to H200 nodes                                   

.github/workflows/phoenix/submit-bench.sh

  • Replaced GPU constraint with H200 GPU resource specification
  • Changed from -CL40S constraint to --gres=gpu:H200:2
  • Increased ntasks-per-node from 4 to 8 for improved performance
+2/-3     

CodeAnt-AI Description

Harden SLURM monitoring, add targeted test retries, and switch Phoenix GPU jobs to H200

What Changed

  • Monitor script now uses integer timeouts and robust SLURM state checks (squeue then sacct), so it no longer segfaults and will detect terminal job states reliably.
  • Monitor will cancel the SLURM job if the monitor exits abnormally, and prints clearer heartbeat/state messages while streaming job output.
  • Test runner writes failed test UUIDs to tests/failed_uuids.txt; CI will automatically retry up to 5 failed tests by re-running only those tests instead of rerunning the whole suite.
  • Phoenix GPU job submit scripts now request H200 GPUs and increase tasks-per-node from 4 to 8 to improve scheduling and node utilization.
  • Directory removal is retried on transient errors to reduce failures from flaky filesystem operations.

Impact

✅ Fewer orphaned SLURM jobs
✅ Fewer CI flakes by retrying sporadic test failures
✅ Faster Phoenix GPU job scheduling

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

sbryngelson and others added 2 commits February 12, 2026 11:51
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
read -t 0.1 (sub-second timeout) in a loop with process substitution
file descriptors triggers a bash internal error (unwind_frame_run:
read_builtin: frame not found) leading to a segfault. Use integer
timeout (read -t 1) instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 12, 2026 16:51
@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 12, 2026

CodeAnt AI is reviewing your PR.


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@qodo-code-review
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

In sbatch_gpu_opts, the new #SBATCH --gres=gpu:H200:2 line lacks the trailing \ used to build the multi-line string. This can truncate the intended sbatch_gpu_opts contents (e.g., dropping the following --ntasks-per-node line) depending on how the string is later expanded/used, and can lead to missing SBATCH directives at submission time.

sbatch_gpu_opts="\
#SBATCH --gres=gpu:H200:2
#SBATCH --ntasks-per-node=8       # Number of cores per node required\
"
Possible Issue

Same potential string-construction issue in sbatch_gpu_opts: the added #SBATCH --gres=gpu:H200:2 line does not end with \, which may prevent subsequent SBATCH lines from being included in the generated submission script content.

sbatch_gpu_opts="\
#SBATCH --gres=gpu:H200:2
#SBATCH --ntasks-per-node=8       # Number of cores per node required\
"
Scheduling Risk

Switching from an explicit GPU partition list to only --gres=gpu:H200:2 may fail or schedule unexpectedly on systems where GPU type is partitioned/controlled via -p and/or constraints rather than GRES alone. Confirm the Phoenix Slurm configuration supports selecting H200 nodes purely via GRES and that no required partition/QoS/account flags were implicitly provided before.

sbatch_gpu_opts="\
#SBATCH --gres=gpu:H200:2
#SBATCH --ntasks-per-node=8       # Number of cores per node required\
"

@codeant-ai codeant-ai bot added the size:XS This PR changes 0-9 lines, ignoring generated files label Feb 12, 2026
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 3 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.

@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 12, 2026

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review

  • GPU resource & task density change
    The GPU reservation changed to request H200 GPUs and --ntasks-per-node was increased to 8 (from 4).
    This can affect scheduling, partition compatibility and CPU/GPU oversubscription on some clusters.
    Confirm the cluster supports --gres=gpu:H200:2 and that 8 tasks per node is appropriate for the target node type.

  • Read timeout change
    The non-blocking read -t timeouts were changed from fractional values (0.1 / 0.5) to 1 second integers.
    This increases the maximum blocking time of the read operation and therefore can increase latency of job-status
    checks and heartbeats, and may slow detection of transient squeue failures. Consider making the timeout configurable
    or reduce it if faster responsiveness is required.

  • Drain loop blocking delay
    The drain loop now uses read -t 1 and can wait up to 1s for the final read attempt before exiting.
    That can add up to ~1s delay on drain completion. Verify this is acceptable and that the tail process is reliably
    terminated after draining (tail_pid handling).

@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 12, 2026

CodeAnt AI finished reviewing your PR.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Refactors SLURM monitoring to use state-driven polling, improves exit-code extraction and tailing behavior, adds cancellation on abnormal exit; updates GPU SBATCH requests to --gres=gpu:H200:2 with --ntasks-per-node=8; changes bench workflow triggers to PR events; adds CI retry for up to 5 failed tests and persists failed test UUIDs.

Changes

Cohort / File(s) Summary
SLURM monitor script
.github/scripts/monitor_slurm_job.sh
Replaces squeue-retry/backoff with state-driven loop (get_job_state, is_terminal_state), adds scontrol/sacct fallbacks for ExitCode, cancels job on abnormal exit, increases tail read timeout (0.1s→1s) with burst-read cap, drains output with stabilization checks and capped drain.
GPU SBATCH submit scripts
.github/workflows/phoenix/submit-bench.sh, .github/workflows/phoenix/submit.sh
Activates GPU resource requests: replaces prior partition/constraint flags with --gres=gpu:H200:2 and updates --ntasks-per-node=8, adjusting GPU and per-node task configuration.
CI bench workflow
.github/workflows/bench.yml
Switches trigger from workflow_run to PR events (pull_request, pull_request_review), removes workflow_run-dependent PR-info steps, and simplifies concurrency/gating to use PR refs and review-based conditions.
CI test retry & failure persistence
.github/workflows/test.yml, toolchain/mfc/test/test.py
Adds post-run retry that reruns up to 5 failed tests using failed_uuids.txt; test harness now writes/removes failed_uuids.txt to reflect current failures for selective retries.

Sequence Diagram(s)

sequenceDiagram
    participant Monitor as monitor_slurm_job.sh
    participant SlurmCtl as squeue/sacct/scontrol
    participant JobFS as Job output file (filesystem)
    participant Tail as tail process

    Monitor->>SlurmCtl: get_job_state(job_id) (squeue → sacct fallback)
    SlurmCtl-->>Monitor: STATE (PENDING/RUNNING/COMPLETED/UNKNOWN)
    alt STATE PENDING/CONFIGURING
        Monitor->>Monitor: sleep (~10s), continue polling
    else STATE RUNNING/COMPLETING
        Monitor->>Tail: start non-blocking tail (1s timeout, burst cap)
        Tail->>JobFS: read new lines
        JobFS-->>Tail: new output
        Tail-->>Monitor: heartbeat + output
    else STATE TERMINAL
        Monitor->>Tail: stop tail, drain remaining lines (1s timeout, cap)
        Monitor->>SlurmCtl: scontrol show job -> ExitCode (fallback sacct)
        SlurmCtl-->>Monitor: ExitCode or UNKNOWN
        alt ExitCode == 0
            Monitor-->>Monitor: set monitor_success, exit 0
        else
            Monitor->>SlurmCtl: scancel job_id (if needed)
            Monitor-->>Monitor: exit non-zero
        end
    else STATE UNKNOWN
        Monitor->>Monitor: periodic warning, longer sleep, retry
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

size:L

Poem

🐇 I hop through logs with whiskers bright and keen,
squeue and sacct whisper states in fields of green.
Two H200s hum, eight tasks snug on each node,
I stash tiny UUIDs down my burrowed road.
Thump — the monitor settles, output safe and serene.

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (16 files):

⚔️ .github/scripts/monitor_slurm_job.sh (content)
⚔️ .github/workflows/bench.yml (content)
⚔️ .github/workflows/phoenix/submit-bench.sh (content)
⚔️ .github/workflows/phoenix/submit.sh (content)
⚔️ .github/workflows/test.yml (content)
⚔️ .pr_agent.toml (content)
⚔️ .typos.toml (content)
⚔️ CMakeLists.txt (content)
⚔️ docs/Doxyfile.in (content)
⚔️ docs/documentation/case.md (content)
⚔️ docs/documentation/readme.md (content)
⚔️ docs/documentation/references.md (content)
⚔️ docs/documentation/visualization.md (content)
⚔️ toolchain/mfc/lint_docs.py (content)
⚔️ toolchain/mfc/params/definitions.py (content)
⚔️ toolchain/mfc/test/test.py (content)

These conflicts must be resolved before merging into master.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main objectives: fixing a bash segfault in the monitor script and switching Phoenix GPU jobs to H200 nodes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering all required sections with clear summaries, diagrams, file-by-file walkthrough, and detailed testing information.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Phoenix (GT) GitHub Actions SLURM submission scripts to target H200 GPU resources for improved scheduling, and adjusts the SLURM job monitor script to avoid a Bash segfault caused by fractional read -t timeouts.

Changes:

  • Update Phoenix GPU sbatch directives to request H200 GPUs and adjust task count per node.
  • Replace fractional read -t timeouts with integer timeouts in the SLURM job monitor script.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
.github/workflows/phoenix/submit.sh Switch GPU submission options to request H200 GPUs and increase --ntasks-per-node.
.github/workflows/phoenix/submit-bench.sh Same as above for benchmark submissions.
.github/scripts/monitor_slurm_job.sh Use integer read -t timeouts to prevent Bash segfaults during output streaming.

PR MFlowCode#1124 changed bench.yml to use workflow_run (triggered after Test
Suite completes), which broke the approve-to-run flow for fork PRs.
Revert to the original pull_request + pull_request_review triggers
while keeping improvements (frontier_amd matrix, concurrency group,
timeout, run_parallel_benchmarks.sh).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Feb 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.03%. Comparing base (0ba6c02) to head (9eed0c6).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1135   +/-   ##
=======================================
  Coverage   44.03%   44.03%           
=======================================
  Files          70       70           
  Lines       20649    20649           
  Branches     2053     2054    +1     
=======================================
  Hits         9093     9093           
  Misses      10368    10368           
  Partials     1188     1188           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Write failed test UUIDs to tests/failed_uuids.txt after a test run.
In CI, if 1-5 tests fail, automatically re-run just those tests.
If 6+ fail, treat it as a real issue and fail immediately.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 12, 2026

CodeAnt AI is running Incremental review


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai codeant-ai bot added size:M This PR changes 30-99 lines, ignoring generated files and removed size:XS This PR changes 0-9 lines, ignoring generated files labels Feb 12, 2026
@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 12, 2026

CodeAnt AI Incremental review completed.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name=".github/workflows/test.yml">

<violation number="1" location=".github/workflows/test.yml:138">
P2: The initial test run is forced to succeed with `|| true`, so real failures can be masked when `tests/failed_uuids.txt` is missing. Preserve the exit code and fail the job if no retry is performed.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/test.yml:
- Around line 137-153: The test step currently uses "|| true" after invoking
mfc.sh which hides crashes; remove that and capture the test command's exit
status (run mfc.sh test -v ... and set TEST_EXIT=$? or use if ! ...; then
TEST_EXIT=$? else TEST_EXIT=0) so you can decide later; keep the existing retry
logic that checks tests/failed_uuids.txt (NUM_FAILED, FAILED) and, after that
block, if tests/failed_uuids.txt does not exist and TEST_EXIT is non‑zero then
exit with TEST_EXIT (or otherwise propagate the original non‑zero exit code) to
avoid reporting success on a crash. Ensure you reference the same mfc.sh
invocation and tests/failed_uuids.txt and use TEST_EXIT when making the final
exit decision.
🧹 Nitpick comments (2)
toolchain/mfc/test/test.py (1)

209-216: Race condition: failed_tests is mutated from worker threads without synchronization.

failed_tests is appended to in handle_case (line 505) which runs in worker threads. While CPython's GIL makes list.append atomic, iterating over failed_tests here (line 213) is safe only because sched.sched (line 179) has already joined all workers by this point. This is fine but fragile — a future refactor that moves this block could introduce a bug. A brief comment would help.

.github/workflows/test.yml (1)

144-144: Minor: useless use of cat.

tr '\n' ' ' < tests/failed_uuids.txt is slightly cleaner, though this is purely cosmetic.

sbryngelson and others added 2 commits February 12, 2026 17:40
Don't mask non-zero exit codes when tests crash before writing
failed_uuids.txt. Only suppress the exit code when the file exists
(meaning the test framework ran to completion and we can retry).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace squeue exit-code polling with get_job_state() that parses
the actual state string (squeue + sacct fallback). Never give up on
UNKNOWN state — CI timeout is the backstop. Cancel orphaned SLURM
jobs on abnormal monitor exit. Include job state in heartbeats.

Incorporates changes from PR MFlowCode#1140.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name=".github/scripts/monitor_slurm_job.sh">

<violation number="1" location=".github/scripts/monitor_slurm_job.sh:40">
P2: Guard the `squeue` pipeline so transient command failures don't abort the script under `set -euo pipefail`; otherwise a temporary SLURM outage exits the monitor instead of returning "UNKNOWN".</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Use -n -X -P flags with sacct: -X restricts to job allocation (not
steps), -P gives pipe-delimited output for reliable parsing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/scripts/monitor_slurm_job.sh (1)

162-185: ⚠️ Potential issue | 🟡 Minor

Stale comment: tail is already stopped at this point.

Line 167 says "Wait for output file to finish growing (stabilize) before stopping tail," but tail was already killed on line 164 and the fd closed on line 163. The actual purpose of this block is to wait for NFS to flush before the final cat on line 189.

📝 Suggested comment fix
-# Wait for output file to finish growing (stabilize) before stopping tail
+# Wait for output file to stabilize (e.g. NFS flush) before final cat
🧹 Nitpick comments (1)
.github/scripts/monitor_slurm_job.sh (1)

59-66: Comprehensive terminal state coverage.

Good list. One very minor note: SLURM also defines REVOKED as a terminal state (for federated jobs), but it's extremely rare and unlikely to appear in practice.

With pipefail, a transient squeue failure would exit the script
instead of falling through to return UNKNOWN. Add || true to both
pipelines. Also fix stale comment about tail stopping.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 13, 2026

CodeAnt AI is running Incremental review


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:M This PR changes 30-99 lines, ignoring generated files labels Feb 13, 2026
@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 13, 2026

CodeAnt AI Incremental review completed.

shutil.rmtree can fail with "Directory not empty" on networked
filesystems (Lustre) due to metadata propagation delays. Retry
up to 5 times with 1s backoff before raising.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 13, 2026

CodeAnt AI is running Incremental review


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 13, 2026
@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 13, 2026

CodeAnt AI Incremental review completed.

On self-hosted runners the workspace persists between runs, so a
leftover file could trigger spurious retries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bot review events (pull_request_review) were racing against and
cancelling legitimate push-triggered (pull_request) benchmark runs
via the shared concurrency group.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@@ -57,14 +111,13 @@ exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
tail_pid=$!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: To reliably capture the tail process PID for cleanup, start it as a background job and use a named pipe (FIFO) for I/O redirection instead of process substitution. [possible issue, importance: 7]

New proposed code:
-exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
+fifo="$(mktemp -u)"
+mkfifo "$fifo"
+
+# Start tail as a real background job so $! is the tail PID
+stdbuf -oL -eL tail -f "$output_file" >"$fifo" 2>&1 &
 tail_pid=$!
 
+# Read tail output from FD 3
+exec 3<"$fifo"
+

Comment on lines +210 to +216
failed_uuids_path = os.path.join(common.MFC_TEST_DIR, "failed_uuids.txt")
if failed_tests:
with open(failed_uuids_path, "w") as f:
for test_info in failed_tests:
f.write(test_info['uuid'] + "\n")
elif os.path.exists(failed_uuids_path):
os.remove(failed_uuids_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Improve the robustness of writing the failed_uuids.txt file by using .get() to avoid a KeyError and by writing to a temporary file before atomically replacing the final file. [possible issue, importance: 6]

Suggested change
failed_uuids_path = os.path.join(common.MFC_TEST_DIR, "failed_uuids.txt")
if failed_tests:
with open(failed_uuids_path, "w") as f:
for test_info in failed_tests:
f.write(test_info['uuid'] + "\n")
elif os.path.exists(failed_uuids_path):
os.remove(failed_uuids_path)
failed_uuids_path = os.path.join(common.MFC_TEST_DIR, "failed_uuids.txt")
tmp_failed_uuids_path = failed_uuids_path + ".tmp"
if failed_tests:
uuids = [t.get("uuid") for t in failed_tests if isinstance(t, dict) and t.get("uuid")]
with open(tmp_failed_uuids_path, "w") as f:
for uuid in uuids:
f.write(uuid + "\n")
os.replace(tmp_failed_uuids_path, failed_uuids_path)
else:
if os.path.exists(tmp_failed_uuids_path):
os.remove(tmp_failed_uuids_path)
if os.path.exists(failed_uuids_path):
os.remove(failed_uuids_path)

Comment on lines +143 to +157
if [ -f tests/failed_uuids.txt ]; then
NUM_FAILED=$(wc -l < tests/failed_uuids.txt)
if [ "$NUM_FAILED" -le 5 ]; then
FAILED=$(cat tests/failed_uuids.txt | tr '\n' ' ')
echo ""
echo "=== Retrying $NUM_FAILED failed test(s): $FAILED ==="
echo ""
/bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) --only $FAILED $TEST_ALL
else
echo "Too many failures ($NUM_FAILED) to retry — likely a real issue."
exit 1
fi
elif [ "$TEST_EXIT" -ne 0 ]; then
exit $TEST_EXIT
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add a check to ensure failed_uuids.txt is not empty before attempting to retry tests, preventing the --only flag from being called with no arguments. [possible issue, importance: 7]

Suggested change
if [ -f tests/failed_uuids.txt ]; then
NUM_FAILED=$(wc -l < tests/failed_uuids.txt)
if [ "$NUM_FAILED" -le 5 ]; then
FAILED=$(cat tests/failed_uuids.txt | tr '\n' ' ')
echo ""
echo "=== Retrying $NUM_FAILED failed test(s): $FAILED ==="
echo ""
/bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) --only $FAILED $TEST_ALL
else
echo "Too many failures ($NUM_FAILED) to retry — likely a real issue."
exit 1
fi
elif [ "$TEST_EXIT" -ne 0 ]; then
exit $TEST_EXIT
fi
if [ -f tests/failed_uuids.txt ]; then
NUM_FAILED=$(wc -l < tests/failed_uuids.txt)
if [ "$NUM_FAILED" -ge 1 ] && [ "$NUM_FAILED" -le 5 ]; then
FAILED=$(tr '\n' ' ' < tests/failed_uuids.txt)
echo ""
echo "=== Retrying $NUM_FAILED failed test(s): $FAILED ==="
echo ""
/bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) --only $FAILED $TEST_ALL
elif [ "$NUM_FAILED" -eq 0 ]; then
if [ "$TEST_EXIT" -ne 0 ]; then
exit $TEST_EXIT
fi
else
echo "Too many failures ($NUM_FAILED) to retry — likely a real issue."
exit 1
fi
elif [ "$TEST_EXIT" -ne 0 ]; then
exit $TEST_EXIT
fi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Review effort 3/5 size:L This PR changes 100-499 lines, ignoring generated files

Development

Successfully merging this pull request may close these issues.

1 participant