Harden SLURM job monitor against transient squeue failures#1140
Harden SLURM job monitor against transient squeue failures#1140sbryngelson wants to merge 3 commits intoMFlowCode:masterfrom
Conversation
The monitor script used `squeue -j $id &>/dev/null` which only checks the exit code. When squeue itself fails transiently (SLURM daemon overloaded, network hiccup), this is indistinguishable from "job doesn't exist," causing the monitor to give up on jobs that are still PENDING in the queue — leaving orphaned SLURM jobs. Changes: - Add get_job_state() that parses squeue output for the actual state string, with sacct fallback for completed/historical jobs - Never give up on UNKNOWN state (let CI timeout be the backstop) - Cancel orphaned SLURM jobs on abnormal monitor exit - Fix fractional read timeouts that caused bash segfaults - Include job state in heartbeat messages for better diagnostics Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
📝 WalkthroughWalkthroughReworks SLURM job monitoring script with improved state tracking via new utility functions ( Changes
Sequence Diagram(s)sequenceDiagram
actor Script as monitor_slurm_job.sh
participant SLURM as SLURM System
participant File as Output File
Script->>SLURM: get_job_state(job_id)
SLURM-->>Script: state (PENDING/RUNNING/COMPLETED/FAILED/UNKNOWN)
Script->>Script: is_terminal_state(state)?
alt Terminal
Script->>File: check for output file
alt output missing
Script->>SLURM: scancel job_id (cleanup)
Script-->>Script: exit (error)
else output present
Script->>File: tail/stream output (with burst cap)
end
else Non-terminal
Script->>Script: sleep & retry (state-machine)
end
loop Streaming
File->>Script: data (with read timeout)
Script->>SLURM: periodic get_job_state()
Script-->>Script: heartbeat log (state)
alt terminal detected during stream
Script->>Script: note transition, continue draining until stable or hard limit
end
end
Script->>SLURM: scontrol show job job_id (get ExitCode)
alt scontrol returns ExitCode
SLURM-->>Script: ExitCode
else scontrol fails
Script->>SLURM: sacct fallback query
SLURM-->>Script: Exit status
end
Script->>Script: evaluate ExitCode, set monitor_success, cleanup
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| # Check if a state is terminal (job is done, for better or worse) | ||
| is_terminal_state() { | ||
| case "$1" in | ||
| COMPLETED|FAILED|CANCELLED|CANCELLED+|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE) | ||
| return 0 ;; | ||
| *) | ||
| return 1 ;; | ||
| esac | ||
| } |
There was a problem hiding this comment.
Suggestion: Modify the is_terminal_state function to handle all terminal job states, including those with a + suffix like COMPLETED+, by stripping the suffix before evaluation. This prevents the script from hanging on job completion. [possible issue, importance: 8]
| # Check if a state is terminal (job is done, for better or worse) | |
| is_terminal_state() { | |
| case "$1" in | |
| COMPLETED|FAILED|CANCELLED|CANCELLED+|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE) | |
| return 0 ;; | |
| *) | |
| return 1 ;; | |
| esac | |
| } | |
| # Check if a state is terminal (job is done, for better or worse) | |
| is_terminal_state() { | |
| local state_base=${1%+} # remove trailing '+' | |
| case "$state_base" in | |
| COMPLETED|FAILED|CANCELLED|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE) | |
| return 0 ;; | |
| *) | |
| return 1 ;; | |
| esac | |
| } |
| # Close the file descriptor and kill tail | ||
| exec 3<&- | ||
| kill "${tail_pid}" 2>/dev/null || true | ||
| tail_pid="" |
There was a problem hiding this comment.
Suggestion: To prevent a race condition that could orphan the tail process, move the tail_pid="" line from the end of the script into the cleanup function. This ensures the process ID is cleared only after the process is killed. [possible issue, importance: 5]
| # Close the file descriptor and kill tail | |
| exec 3<&- | |
| kill "${tail_pid}" 2>/dev/null || true | |
| tail_pid="" | |
| # Close the file descriptor and kill tail | |
| exec 3<&- | |
| kill "${tail_pid}" 2>/dev/null || true |
| scancel "$job_id" 2>/dev/null || true | ||
| fi | ||
| } | ||
| trap cleanup EXIT |
There was a problem hiding this comment.
Suggestion: Modify the trap command to explicitly catch INT and TERM signals in addition to EXIT, ensuring the cleanup function always runs, even when the script is interrupted. [general, importance: 6]
| trap cleanup EXIT | |
| trap cleanup EXIT INT TERM |
Nitpicks 🔍
|
| local state | ||
|
|
||
| # Try squeue first (fast, works for active jobs) | ||
| state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ') |
There was a problem hiding this comment.
Suggestion: With set -euo pipefail enabled, any non-zero exit from the squeue pipeline inside the command substitution will cause the entire script to exit immediately instead of falling back to sacct or returning UNKNOWN, so transient squeue failures will abort the monitor instead of being treated as non-fatal. [logic error]
Severity Level: Critical 🚨
- ❌ Monitor aborts on transient squeue failures, cancelling SLURM jobs.
- ⚠️ CI SLURM-based benchmark jobs become flaky under controller load.
- ⚠️ Sacct/UNKNOWN fallback never reached when squeue briefly fails.| state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ') | |
| state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ' || true) |
Steps of Reproduction ✅
1. Run the monitor script `.github/scripts/monitor_slurm_job.sh` (entry point used by CI,
see file header at `.github/scripts/monitor_slurm_job.sh:1`) with a valid SLURM job id:
`./monitor_slurm_job.sh <job_id> out.log`.
2. The script enables strict mode with `set -euo pipefail` at
`.github/scripts/monitor_slurm_job.sh:5` and later calls `get_job_state "$job_id"` in both
the "wait for job to start" loop (`state=$(get_job_state "$job_id")` around
`.github/scripts/monitor_slurm_job.sh:74`) and the main monitor loop
(`state=$(get_job_state "$job_id")` around `.github/scripts/monitor_slurm_job.sh:132`).
3. During one of these calls, cause a realistic transient SLURM failure so that `squeue -j
"$jid"` returns a non‑zero exit code (e.g., SLURM controller temporarily unreachable or
overloaded as described in the PR summary). This executes the line `state=$(squeue -j
"$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ')` at
`.github/scripts/monitor_slurm_job.sh:40` inside `get_job_state`.
4. Because `set -euo pipefail` is active (`pipefail` makes the pipeline's exit status
non‑zero when `squeue` fails, and the variable assignment is a simple command), the
non‑zero status from the `state=$(...)` command at line 40 causes the entire script to
exit immediately instead of continuing to the sacct fallback (`if command -v sacct` at
line 47) or printing `UNKNOWN` at line 55. The `cleanup` trap at
`.github/scripts/monitor_slurm_job.sh:7-19` then runs, sees `monitor_success` still
unset/0 and a non‑empty `$job_id`, and executes `scancel "$job_id"`, cancelling the
still‑valid job and printing "Monitor exiting abnormally — cancelling SLURM job <job_id>".Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** .github/scripts/monitor_slurm_job.sh
**Line:** 40:40
**Comment:**
*Logic Error: With `set -euo pipefail` enabled, any non-zero exit from the `squeue` pipeline inside the command substitution will cause the entire script to exit immediately instead of falling back to `sacct` or returning `UNKNOWN`, so transient `squeue` failures will abort the monitor instead of being treated as non-fatal.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.|
|
||
| # Fallback to sacct (works for completed/historical jobs) | ||
| if command -v sacct >/dev/null 2>&1; then | ||
| state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}') |
There was a problem hiding this comment.
Suggestion: Similarly to the squeue call, the sacct pipeline in get_job_state will, under set -euo pipefail, cause the script to exit on any non-zero sacct failure (e.g., slurmdbd hiccups) instead of returning UNKNOWN, defeating the intended robustness against transient SLURM issues. [logic error]
Severity Level: Critical 🚨
- ❌ Monitor exits on transient sacct failures, cancelling jobs.
- ⚠️ Historical/finished job state resolution becomes fragile and flaky.
- ⚠️ UNKNOWN state path never reached when sacct briefly unavailable.| state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}') | |
| state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}' || true) |
Steps of Reproduction ✅
1. Run `.github/scripts/monitor_slurm_job.sh <job_id> out.log` (script entry point at
`.github/scripts/monitor_slurm_job.sh:1`) under a SLURM setup where accounting (`sacct` /
slurmdbd) can experience transient failures.
2. Allow the monitored job to reach a state where `squeue -j "$job_id"` no longer returns
a state line (e.g., job completed and aged out of the active queue), so `get_job_state` at
`.github/scripts/monitor_slurm_job.sh:35-56` falls through its initial squeue query (lines
39–42) and enters the sacct fallback guarded by `if command -v sacct` at line 47.
3. During a realistic slurmdbd/sacct hiccup, have the sacct pipeline `sacct -j "$jid"
--format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}'` at
`.github/scripts/monitor_slurm_job.sh:48` return a non‑zero exit status (e.g., sacct
cannot contact the accounting daemon and exits with error).
4. With `set -euo pipefail` enabled at line 5, the non‑zero status from this sacct
pipeline (propagated by `pipefail`) causes the `state=$(...)` simple command at line 48 to
fail, triggering `set -e` and exiting the entire script immediately, before it can echo
`UNKNOWN` at line 55. The EXIT trap (`cleanup` at lines 7–17) runs, sees `monitor_success`
still 0 and a non‑empty `job_id`, and calls `scancel "$job_id"`, cancelling the job or at
least misreporting an internal monitor failure instead of returning UNKNOWN state as
designed.Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** .github/scripts/monitor_slurm_job.sh
**Line:** 48:48
**Comment:**
*Logic Error: Similarly to the squeue call, the sacct pipeline in get_job_state will, under `set -euo pipefail`, cause the script to exit on any non-zero sacct failure (e.g., slurmdbd hiccups) instead of returning `UNKNOWN`, defeating the intended robustness against transient SLURM issues.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.|
CodeAnt AI finished reviewing your PR. |
There was a problem hiding this comment.
Pull request overview
This PR hardens the SLURM job monitor script against transient squeue failures that previously caused false job abandonment. The key improvement is replacing simple exit code checks with robust state parsing that distinguishes between "job doesn't exist" and "squeue temporarily unavailable."
Changes:
- Introduces
get_job_state()function that parses actual SLURM state strings withsacctfallback for completed jobs - Adds
is_terminal_state()helper to correctly identify when jobs are truly finished - Enhances cleanup handler to cancel orphaned SLURM jobs when the monitor exits abnormally
- Fixes fractional
readtimeouts (0.1s → 1s) that caused bash segfaults on some systems - Improves diagnostics by including job state in heartbeat messages
| *) | ||
| # Terminal state — job finished without creating output | ||
| if is_terminal_state "$state"; then | ||
| echo "ERROR: Job $job_id reached terminal state ($state) without creating output file" | ||
| exit 1 | ||
| fi | ||
| break | ||
| fi | ||
| # Exponential backoff | ||
| sleep_time=$((2 ** squeue_retries)) | ||
| echo "Warning: squeue check failed, retrying in ${sleep_time}s..." | ||
| sleep $sleep_time | ||
| fi | ||
| # Unrecognized state, keep waiting | ||
| sleep 5 | ||
| ;; |
There was a problem hiding this comment.
In the case where the state is neither PENDING, CONFIGURING, RUNNING, COMPLETING, UNKNOWN, nor a terminal state recognized by is_terminal_state(), the script will sleep and continue waiting indefinitely. This handles new SLURM states that might be introduced in the future. However, consider logging these unrecognized non-terminal states at least once to help diagnose unexpected SLURM behavior (e.g., "Warning: Job in unrecognized state: REQUEUED, continuing to wait...").
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/scripts/monitor_slurm_job.sh (1)
194-198:⚠️ Potential issue | 🔴 CriticalBug:
grepmatches bothExitCodeandDerivedExitCode, causing successful jobs to be reported as failures.When
scontrol show joboutput contains bothExitCode=X:YandDerivedExitCode=X:Yfields (standard output), the patternExitCode=[0-9]+:[0-9]+matches both, since "ExitCode=" is a substring of "DerivedExitCode=". This produces two lines inexit_code(e.g., "0:0\n0:0"), and the comparison at line 217 ([ "$exit_code" != "0:0" ]) evaluates to true because the multi-line string doesn't equal "0:0", causing the script to incorrectly exit with error code 1 for successful jobs.Proposed fix: anchor the pattern with a word boundary
- exit_code=$(echo "$scontrol_output" | grep -oE 'ExitCode=[0-9]+:[0-9]+' | cut -d= -f2 || echo "") + exit_code=$(echo "$scontrol_output" | grep -oE '\bExitCode=[0-9]+:[0-9]+' | head -n1 | cut -d= -f2 || echo "")The
\bword boundary preventsDerivedExitCodefrom matching.head -n1adds a safety net against multiple matches.
🧹 Nitpick comments (1)
.github/scripts/monitor_slurm_job.sh (1)
35-56:sacctmay return sub-step states — consider filtering to the batch step.
sacct -j "$jid"returns one row per job step (job, batch, extern, etc.).head -n1grabs the first row, which is typically the overall "job" record — but on some SLURM configurations the order isn't guaranteed or an extra blank/sub-step line may appear first. You could pin to the.batchstep or filter with--parsable2for more deterministic output.That said, the
UNKNOWNfallback makes this safe in practice.Possible hardening
- state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}') + state=$(sacct -j "$jid.batch" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}')
| @@ -57,14 +111,13 @@ exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1) | |||
| tail_pid=$! | |||
There was a problem hiding this comment.
Suggestion: Replace the use of process substitution (<(...)) and $! with coproc to reliably capture the tail process PID. [possible issue, importance: 7]
New proposed code:
# Start tail and redirect its output to file descriptor 3 for multiplexing
# This allows us to stream tail output while also printing heartbeat messages
-exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
-tail_pid=$!
+coproc TAIL_PROC { stdbuf -oL -eL tail -f "$output_file" 2>&1; }
+exec 3<&"${TAIL_PROC[0]}"
+tail_pid="$TAIL_PROC_PID"| get_job_state() { | ||
| local jid="$1" | ||
| local state | ||
|
|
||
| # Try squeue first (fast, works for active jobs) | ||
| state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ') | ||
| if [ -n "$state" ]; then | ||
| echo "$state" | ||
| return | ||
| fi | ||
|
|
||
| # Fallback to sacct (works for completed/historical jobs) | ||
| if command -v sacct >/dev/null 2>&1; then | ||
| state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}') | ||
| if [ -n "$state" ]; then | ||
| echo "$state" | ||
| return | ||
| fi | ||
| fi | ||
|
|
||
| echo "UNKNOWN" | ||
| } |
There was a problem hiding this comment.
Suggestion: Improve the sacct command in get_job_state to query the .batch step explicitly and use parsable output flags (-n -X -P) for more reliable job state detection. [possible issue, importance: 8]
| get_job_state() { | |
| local jid="$1" | |
| local state | |
| # Try squeue first (fast, works for active jobs) | |
| state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ') | |
| if [ -n "$state" ]; then | |
| echo "$state" | |
| return | |
| fi | |
| # Fallback to sacct (works for completed/historical jobs) | |
| if command -v sacct >/dev/null 2>&1; then | |
| state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}') | |
| if [ -n "$state" ]; then | |
| echo "$state" | |
| return | |
| fi | |
| fi | |
| echo "UNKNOWN" | |
| } | |
| get_job_state() { | |
| local jid="$1" | |
| local state | |
| # Try squeue first (fast, works for active jobs) | |
| state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ') | |
| if [ -n "$state" ]; then | |
| echo "$state" | |
| return | |
| fi | |
| # Fallback to sacct (works for completed/historical jobs) | |
| if command -v sacct >/dev/null 2>&1; then | |
| state=$(sacct -j "${jid}.batch" -n -X -P -o State 2>/dev/null | head -n1 | cut -d'|' -f1) | |
| if [ -n "$state" ]; then | |
| echo "$state" | |
| return | |
| fi | |
| fi | |
| echo "UNKNOWN" | |
| } |
| @@ -57,14 +111,13 @@ exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1) | |||
| tail_pid=$! | |||
There was a problem hiding this comment.
Suggestion: Use coproc to reliably capture the tail process PID and use tail -F instead of tail -f to better handle log file rotation. [possible issue, importance: 7]
New proposed code:
-exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
-tail_pid=$!
+coproc TAILPROC { stdbuf -oL -eL tail -F "$output_file" 2>&1; }
+exec 3<&"${TAILPROC[0]}"
+tail_pid="${TAILPROC_PID}"| get_job_state() { | ||
| local jid="$1" | ||
| local state | ||
|
|
||
| # Try squeue first (fast, works for active jobs) | ||
| state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ') | ||
| if [ -n "$state" ]; then | ||
| echo "$state" | ||
| return | ||
| fi | ||
|
|
||
| # Fallback to sacct (works for completed/historical jobs) | ||
| if command -v sacct >/dev/null 2>&1; then | ||
| state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}') | ||
| if [ -n "$state" ]; then | ||
| echo "$state" | ||
| return | ||
| fi | ||
| fi | ||
|
|
||
| echo "UNKNOWN" | ||
| } |
There was a problem hiding this comment.
Suggestion: Wrap squeue and sacct calls with a timeout command to prevent the script from hanging if the SLURM controller is unresponsive. [possible issue, importance: 8]
| get_job_state() { | |
| local jid="$1" | |
| local state | |
| # Try squeue first (fast, works for active jobs) | |
| state=$(squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ') | |
| if [ -n "$state" ]; then | |
| echo "$state" | |
| return | |
| fi | |
| # Fallback to sacct (works for completed/historical jobs) | |
| if command -v sacct >/dev/null 2>&1; then | |
| state=$(sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}') | |
| if [ -n "$state" ]; then | |
| echo "$state" | |
| return | |
| fi | |
| fi | |
| echo "UNKNOWN" | |
| } | |
| get_job_state() { | |
| local jid="$1" | |
| local state | |
| local tcmd="" | |
| if command -v timeout >/dev/null 2>&1; then | |
| tcmd="timeout 5" | |
| fi | |
| # Try squeue first (fast, works for active jobs) | |
| state=$($tcmd squeue -j "$jid" -h -o '%T' 2>/dev/null | head -n1 | tr -d ' ') | |
| if [ -n "$state" ]; then | |
| echo "$state" | |
| return | |
| fi | |
| # Fallback to sacct (works for completed/historical jobs) | |
| if command -v sacct >/dev/null 2>&1; then | |
| state=$($tcmd sacct -j "$jid" --format=State --noheader 2>/dev/null | head -n1 | awk '{print $1}') | |
| if [ -n "$state" ]; then | |
| echo "$state" | |
| return | |
| fi | |
| fi | |
| echo "UNKNOWN" | |
| } |
| if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then | ||
| echo "Monitor exiting abnormally — cancelling SLURM job $job_id" | ||
| scancel "$job_id" 2>/dev/null || true | ||
| fi |
There was a problem hiding this comment.
Suggestion: Before calling scancel in the cleanup handler, check if the job is in a non-terminal state to avoid accidentally cancelling a different job that has reused the same ID. [security, importance: 9]
| if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then | |
| echo "Monitor exiting abnormally — cancelling SLURM job $job_id" | |
| scancel "$job_id" 2>/dev/null || true | |
| fi | |
| if [ "${monitor_success:-0}" -ne 1 ] && [ -n "${job_id:-}" ]; then | |
| state="$(get_job_state "$job_id")" | |
| if ! is_terminal_state "$state"; then | |
| echo "Monitor exiting abnormally — cancelling SLURM job $job_id" | |
| scancel "$job_id" 2>/dev/null || true | |
| else | |
| echo "Monitor exiting abnormally — job $job_id already terminal ($state), not cancelling" | |
| fi | |
| fi |
Replace squeue exit-code polling with get_job_state() that parses the actual state string (squeue + sacct fallback). Never give up on UNKNOWN state — CI timeout is the backstop. Cancel orphaned SLURM jobs on abnormal monitor exit. Include job state in heartbeats. Incorporates changes from PR MFlowCode#1140. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1140 +/- ##
=======================================
Coverage 44.03% 44.03%
=======================================
Files 70 70
Lines 20649 20649
Branches 2053 2054 +1
=======================================
Hits 9093 9093
Misses 10368 10368
Partials 1188 1188 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
User description
Summary
squeue -j $id &>/dev/nullwhich only checks the exit code. Whensqueueitself fails transiently (SLURM daemon overloaded, network hiccup), this is indistinguishable from "job doesn't exist," causing the monitor to give up on jobs that are still PENDING — leaving orphaned SLURM jobs.get_job_state()that parsessqueueoutput for the actual state string, withsacctfallback for completed/historical jobsreadtimeouts that caused bash segfaults on some systemsTest plan
squeueis transiently unavailable🤖 Generated with Claude Code
CodeAnt-AI Description
Harden SLURM job monitor: robust state checks, cancel orphaned jobs, improved streaming reliability
What Changed
Impact
✅ Fewer orphaned SLURM jobs left when monitors are killed✅ Clearer job state shown in heartbeat logs during runs✅ Fewer monitor crashes and more reliable streaming of job output💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.
Summary by CodeRabbit
Bug Fixes
Chores
Documentation