Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 8 additions & 9 deletions .github/workflows/bench.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,47 +91,47 @@ jobs:
flag: p
device: cpu
interface: none
build_script: ""
build_script: "bash .github/workflows/phoenix/build.sh cpu none"
- cluster: phoenix
name: Georgia Tech | Phoenix (NVHPC)
group: phoenix
labels: gt
flag: p
device: gpu
interface: acc
build_script: ""
build_script: "bash .github/workflows/phoenix/build.sh gpu acc"
- cluster: phoenix
name: Georgia Tech | Phoenix (NVHPC)
group: phoenix
labels: gt
flag: p
device: gpu
interface: omp
build_script: ""
build_script: "bash .github/workflows/phoenix/build.sh gpu omp"
- cluster: frontier
name: Oak Ridge | Frontier (CCE)
group: phoenix
labels: frontier
flag: f
device: gpu
interface: acc
build_script: "bash .github/workflows/frontier/build.sh gpu acc bench"
build_script: "bash .github/workflows/frontier/build.sh gpu acc"
- cluster: frontier
name: Oak Ridge | Frontier (CCE)
group: phoenix
labels: frontier
flag: f
device: gpu
interface: omp
build_script: "bash .github/workflows/frontier/build.sh gpu omp bench"
build_script: "bash .github/workflows/frontier/build.sh gpu omp"
- cluster: frontier_amd
name: Oak Ridge | Frontier (AMD)
group: phoenix
labels: frontier
flag: famd
device: gpu
interface: omp
build_script: "bash .github/workflows/frontier_amd/build.sh gpu omp bench"
build_script: "bash .github/workflows/frontier_amd/build.sh gpu omp"
runs-on:
group: ${{ matrix.group }}
labels: ${{ matrix.labels }}
Expand All @@ -153,9 +153,8 @@ jobs:
ref: master
path: master

- name: Setup & Build
if: matrix.build_script != ''
run: |
- name: Fetch Dependencies
run: |
(cd pr && ${{ matrix.build_script }}) &
(cd master && ${{ matrix.build_script }}) &
wait %1 && wait %2
Comment on lines +156 to 160
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add a conditional to the Fetch Dependencies step to only run when matrix.build_script is not empty, preventing potential syntax errors. [possible issue, importance: 6]

Suggested change
- name: Fetch Dependencies
run: |
(cd pr && ${{ matrix.build_script }}) &
(cd master && ${{ matrix.build_script }}) &
wait %1 && wait %2
- name: Fetch Dependencies
if: ${{ matrix.build_script != '' }}
run: |
(cd pr && ${{ matrix.build_script }}) &
(cd master && ${{ matrix.build_script }}) &
wait %1 && wait %2

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syntax wait %1 && wait %2 is invalid in bash. The %N syntax is for csh/tcsh job control, not bash. In bash, you should use wait without arguments to wait for all background jobs, or capture the PIDs and wait on them explicitly. Replace this line with just wait to wait for both background jobs spawned on lines 158-159.

Suggested change
wait %1 && wait %2
wait

Copilot uses AI. Check for mistakes.
Comment on lines +156 to 160
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add a conditional if: matrix.build_script != '' to the 'Fetch Dependencies' step to prevent shell errors if build_script is empty. [possible issue, importance: 6]

Suggested change
- name: Fetch Dependencies
run: |
(cd pr && ${{ matrix.build_script }}) &
(cd master && ${{ matrix.build_script }}) &
wait %1 && wait %2
- name: Fetch Dependencies
if: matrix.build_script != ''
run: |
(cd pr && ${{ matrix.build_script }}) &
(cd master && ${{ matrix.build_script }}) &
wait %1 && wait %2

Comment on lines +156 to 160
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add a conditional to the 'Fetch Dependencies' step to skip execution if matrix.build_script is empty, preventing a potential workflow failure. [possible issue, importance: 6]

Suggested change
- name: Fetch Dependencies
run: |
(cd pr && ${{ matrix.build_script }}) &
(cd master && ${{ matrix.build_script }}) &
wait %1 && wait %2
- name: Fetch Dependencies
if: matrix.build_script != ''
run: |
(cd pr && ${{ matrix.build_script }}) &
(cd master && ${{ matrix.build_script }}) &
wait %1 && wait %2

Comment on lines +156 to 160
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add an if: matrix.build_script != '' condition to the 'Fetch Dependencies' step in the bench.yml workflow to prevent shell syntax errors when matrix.build_script is empty. [possible issue, importance: 7]

Suggested change
- name: Fetch Dependencies
run: |
(cd pr && ${{ matrix.build_script }}) &
(cd master && ${{ matrix.build_script }}) &
wait %1 && wait %2
- name: Fetch Dependencies
if: matrix.build_script != ''
run: |
(cd pr && ${{ matrix.build_script }}) &
(cd master && ${{ matrix.build_script }}) &
wait %1 && wait %2

Expand Down
38 changes: 34 additions & 4 deletions .github/workflows/frontier/bench.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level Suggestion

The build-with-retry logic is duplicated across four shell scripts. This should be consolidated into a single reusable script or integrated into the Python build tool to improve maintainability. [High-level, importance: 7]

Solution Walkthrough:

Before:

# In .github/workflows/frontier/test.sh
max_attempts=3
attempt=1
while [ $attempt -le $max_attempts ]; do
    echo "Build attempt $attempt..."
    if ./mfc.sh test --dry-run ...; then
        echo "Build succeeded."
        break
    fi
    if [ $attempt -lt $max_attempts ]; then
        echo "Build failed... retrying..."
        ./mfc.sh clean
        sleep 30
    fi
    attempt=$((attempt + 1))
done
# ... similar logic repeated in 3 other files ...

After:

# In a new script, e.g., .github/workflows/common/retry_build.sh
#!/bin/bash
COMMAND_TO_RUN="$@"
max_attempts=3
attempt=1
while [ $attempt -le $max_attempts ]; do
    echo "Build attempt $attempt..."
    if eval "$COMMAND_TO_RUN"; then
        echo "Build succeeded."
        exit 0
    fi
    # ... retry logic ...
done
exit 1

# In .github/workflows/frontier/test.sh
source .github/workflows/common/retry_build.sh "./mfc.sh test --dry-run ..."

Original file line number Diff line number Diff line change
@@ -1,20 +1,50 @@
#!/bin/bash

n_ranks=12
build_opts=""
device_opts=""
if [ "$job_device" = "gpu" ]; then
gpus=$(rocm-smi --showid | awk '{print $1}' | grep -Eo '[0-9]+' | uniq | tr '\n' ' ')
n_ranks=$(echo "$gpus" | wc -w) # number of GPUs on node
gpu_ids=$(echo "$gpus" | tr ' ' '\n' | tr '\n' ' ' | sed 's/ $//') # GPU IDs from rocm-smi
device_opts+="--gpu"
build_opts+="--gpu"
if [ "$job_interface" = "acc" ]; then
device_opts+=" acc"
build_opts+=" acc"
elif [ "$job_interface" = "omp" ]; then
device_opts+=" mp"
build_opts+=" mp"
fi
device_opts+=" -g $gpu_ids"
device_opts="$build_opts -g $gpu_ids"
fi

# Build case-optimized binaries on compute node (deps already fetched on login node)
max_attempts=3
attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Build attempt $attempt of $max_attempts..."
build_cmd_ok=true
for dir in benchmarks/*/; do
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
build_cmd_ok=false
break
fi
done
Comment on lines +25 to +30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Guard the benchmarks directory glob so the loop is skipped cleanly when no benchmark directories exist instead of iterating over a literal pattern. [custom_rule]

Severity Level: Minor ⚠️

Suggested change
for dir in benchmarks/*/; do
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
build_cmd_ok=false
break
fi
done
shopt -s nullglob
for dir in benchmarks/*/; do
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
build_cmd_ok=false
break
fi
done
shopt -u nullglob
Why it matters? ⭐

The existing code will iterate over the literal pattern "benchmarks/*/" when there are no matches, causing the run command to be invoked with an invalid path and likely fail the build loop. The proposed use of "shopt -s nullglob" directly fixes that by making the glob expand to nothing so the loop is skipped. This addresses a real robustness bug in the shell script; the file has a "#!/bin/bash" shebang so using the bash builtin "shopt" is appropriate. This suggestion targets general shell-glob guarding (the custom_rule), improving correctness without changing project-specific Fortran rules.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** .github/workflows/frontier/bench.sh
**Line:** 25:30
**Comment:**
	*Custom Rule: Guard the benchmarks directory glob so the loop is skipped cleanly when no benchmark directories exist instead of iterating over a literal pattern.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
👍 | 👎

Comment on lines +25 to +30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Use shopt -s nullglob before the for loop over benchmarks/*/ to prevent errors when no benchmark directories exist. [possible issue, importance: 5]

Suggested change
for dir in benchmarks/*/; do
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
build_cmd_ok=false
break
fi
done
shopt -s nullglob
for dir in benchmarks/*/; do
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
build_cmd_ok=false
break
fi
done
shopt -u nullglob

Comment on lines +24 to +30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Use shopt -s nullglob before the benchmark directory loop to correctly handle cases where no matching directories are found, preventing the loop from running on the literal glob pattern. [possible issue, importance: 7]

Suggested change
build_cmd_ok=true
for dir in benchmarks/*/; do
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
build_cmd_ok=false
break
fi
done
build_cmd_ok=true
shopt -s nullglob
bench_dirs=(benchmarks/*/)
shopt -u nullglob
for dir in "${bench_dirs[@]}"; do
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
build_cmd_ok=false
break
fi
done

Comment on lines +25 to +30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: In bench.sh scripts, use shopt -s nullglob before the for dir in benchmarks/*/ loop to prevent it from executing with a literal string if no benchmark directories are found. [possible issue, importance: 6]

Suggested change
for dir in benchmarks/*/; do
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
build_cmd_ok=false
break
fi
done
shopt -s nullglob
for dir in benchmarks/*/; do
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
build_cmd_ok=false
break
fi
done
shopt -u nullglob


if [ "$build_cmd_ok" = true ]; then
echo "Build succeeded on attempt $attempt."
break
fi

if [ $attempt -lt $max_attempts ]; then
echo "Build failed on attempt $attempt. Cleaning source targets and retrying in 30s..."
./mfc.sh clean -t pre_process simulation post_process syscheck
sleep 30
else
echo "Build failed after $max_attempts attempts."
exit 1
fi
attempt=$((attempt + 1))
done

if [ "$job_device" = "gpu" ]; then
./mfc.sh bench --mem 12 -j $n_ranks -o "$job_slug.yaml" -- -c frontier $device_opts -n $n_ranks
else
Expand Down
41 changes: 4 additions & 37 deletions .github/workflows/frontier/build.sh
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
#!/bin/bash

# Fetch dependencies on login node (internet access).
# Source code is built on compute nodes via test.sh / bench.sh.

# Ignore SIGHUP to survive login node session drops
trap '' HUP

job_device=$1
job_interface=$2
run_bench=$3
build_opts=""
if [ "$job_device" = "gpu" ]; then
build_opts+="--gpu"
Expand All @@ -18,39 +20,4 @@ fi

. ./mfc.sh load -c f -m g

max_attempts=3
attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Build attempt $attempt of $max_attempts..."
if [ "$run_bench" == "bench" ]; then
build_cmd_ok=true
for dir in benchmarks/*/; do
dirname=$(basename "$dir")
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
build_cmd_ok=false
break
fi
done
else
if ./mfc.sh test -v -a --dry-run --rdma-mpi -j 8 $build_opts; then
build_cmd_ok=true
else
build_cmd_ok=false
fi
fi

if [ "$build_cmd_ok" = true ]; then
echo "Build succeeded on attempt $attempt."
exit 0
fi

if [ $attempt -lt $max_attempts ]; then
echo "Build failed on attempt $attempt. Cleaning and retrying in 30s..."
./mfc.sh clean
sleep 30
fi
attempt=$((attempt + 1))
done

echo "Build failed after $max_attempts attempts."
exit 1
./mfc.sh build --deps-only -j 8 $build_opts
24 changes: 24 additions & 0 deletions .github/workflows/frontier/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,32 @@ if [ "$job_device" = "gpu" ]; then
elif [ "$job_interface" = "omp" ]; then
device_opts+=" mp"
fi
else
device_opts+=" --no-gpu"
fi

# Build source code on compute node (deps already fetched on login node)
max_attempts=3
attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Build attempt $attempt of $max_attempts..."
if ./mfc.sh test -v -a --dry-run --rdma-mpi -j 8 $device_opts; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The dry-run build step always uses $device_opts and never adds --no-gpu for CPU jobs, so on CPU-only runs the build phase may incorrectly configure GPU-enabled tests, causing the build to fail or behave inconsistently with the actual test invocation that explicitly uses --no-gpu. [logic error]

Severity Level: Major ⚠️
- ⚠️ Frontier CPU CI builds use different config than tests.
- ⚠️ Dry-run may not catch CPU-only configuration issues.
- ⚠️ Potential divergence between built artifacts and tested setup.
Suggested change
if ./mfc.sh test -v -a --dry-run --rdma-mpi -j 8 $device_opts; then
build_device_opts="$device_opts"
if [ "$job_device" != "gpu" ]; then
build_device_opts+=" --no-gpu"
fi
if ./mfc.sh test -v -a --dry-run --rdma-mpi -j 8 $build_device_opts; then
Steps of Reproduction ✅
1. In the Frontier CI workflow, run the `.github/workflows/frontier/test.sh` script with
`job_device` set to a non-GPU value (e.g. `cpu`); see the `device_opts` initialisation and
conditional at lines 7–14 where only `job_device="gpu"` adds `--gpu`.

2. Observe the build loop at lines 17–25: for a CPU job, `device_opts` is empty, so the
dry-run build command executed is `./mfc.sh test -v -a --dry-run --rdma-mpi -j 8` (no
`--gpu`, no `--no-gpu`).

3. Later in the same script, for CPU jobs the real test invocation in the `else` branch at
lines 40–42 is `./mfc.sh test -v -a --max-attempts 3 -j 32 --no-gpu -- -c frontier`,
explicitly forcing a CPU-only configuration.

4. This shows that for CPU jobs the build phase validates a different configuration (no
GPU flag) than the one actually used for tests (`--no-gpu`), so a dry-run build can
succeed while the subsequent CPU-only test run fails or behaves differently because the
GPU/CPU selection is not identical between the two invocations.
Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** .github/workflows/frontier/test.sh
**Line:** 21:21
**Comment:**
	*Logic Error: The dry-run build step always uses `$device_opts` and never adds `--no-gpu` for CPU jobs, so on CPU-only runs the build phase may incorrectly configure GPU-enabled tests, causing the build to fail or behave inconsistently with the actual test invocation that explicitly uses `--no-gpu`.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
👍 | 👎

echo "Build succeeded on attempt $attempt."
break
fi

if [ $attempt -lt $max_attempts ]; then
echo "Build failed on attempt $attempt. Cleaning source targets and retrying in 30s..."
./mfc.sh clean -t pre_process simulation post_process syscheck
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: In the retry path, -t is only guaranteed to apply to the first argument, meaning that simulation, post_process, and syscheck may not actually be cleaned and can cause the test build to keep failing with unchanged artifacts. [logic error]

Severity Level: Major ⚠️
- ⚠️ Frontier CI build retries may not clean all targets.
- ⚠️ Stale artifacts can cause repeated Frontier test failures.
Suggested change
./mfc.sh clean -t pre_process simulation post_process syscheck
./mfc.sh clean -t pre_process -t simulation -t post_process -t syscheck
Steps of Reproduction ✅
1. Push a branch that triggers the Frontier CI workflow which calls
`.github/workflows/frontier/test.sh` on the compute node (see `.github/workflows` CI
configuration; this script is the test entrypoint).

2. Ensure there is a prior partial or failed build so that at least one of the targets
`simulation`, `post_process`, or `syscheck` has stale artifacts while `pre_process` is
either clean or non-problematic.

3. During the CI run, the script enters the build loop at
`.github/workflows/frontier/test.sh:21-37` and the first attempt (`./mfc.sh test -v -a
--dry-run --rdma-mpi -j 8 $device_opts` at line 23) fails due to the stale state in e.g.
the `simulation` target.

4. Because `$attempt -lt $max_attempts`, the cleanup branch at lines 28-31 executes:
`./mfc.sh clean -t pre_process simulation post_process syscheck`. With a conventional
single-argument `-t` implementation in `./mfc.sh` (root script, not shown in this diff,
typically using `getopts`), only `pre_process` is actually bound to `-t`; `simulation`,
`post_process`, and `syscheck` are treated as positional arguments and not as additional
`-t` values. On the next loop iteration, the same stale targets remain uncleaned, causing
the build at line 23 to fail again even though the retry logic was intended to clean all
four targets. This pattern repeats until `max_attempts` is exhausted and the CI job fails.
Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** .github/workflows/frontier/test.sh
**Line:** 30:30
**Comment:**
	*Logic Error: In the retry path, `-t` is only guaranteed to apply to the first argument, meaning that `simulation`, `post_process`, and `syscheck` may not actually be cleaned and can cause the test build to keep failing with unchanged artifacts.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
👍 | 👎

sleep 30
else
echo "Build failed after $max_attempts attempts."
exit 1
fi
attempt=$((attempt + 1))
done
Comment on lines 18 to 37
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find the clean command implementation
echo "=== Searching for clean in commands.py ==="
fd -t f commands.py | xargs rg -n "clean" 2>/dev/null | head -30

echo ""
echo "=== Searching for clean function/handler ==="
fd -t f "*.py" toolchain/mfc | xargs rg -n "def clean|'clean'" 2>/dev/null | grep -E "(def clean|command|'clean')" | head -30

echo ""
echo "=== Checking mfc.sh script directly ==="
cat -n mfc.sh | grep -A 20 -B 2 "clean"

Repository: MFlowCode/MFC

Length of output: 1758


🏁 Script executed:

#!/bin/bash
# Check where deps are built/installed when using --deps-only
echo "=== Searching for deps-only or dependency build paths ==="
rg -n "deps-only|dependencies" toolchain/mfc/build.py | head -20

echo ""
echo "=== Check CMakeLists or build config for dependency installation ==="
fd -t f "CMakeLists.txt" | xargs grep -n "INSTALL\|PREFIX\|dependencies" 2>/dev/null | head -30

echo ""
echo "=== Look for deps-only implementation in build logic ==="
rg -n "deps.only|deps_only" toolchain/mfc/ | head -20

echo ""
echo "=== Check for dependency paths in build.py ==="
grep -n "build\|deps\|INSTALL" toolchain/mfc/build.py | head -40

Repository: MFlowCode/MFC

Length of output: 4908


🏁 Script executed:

#!/bin/bash
# Check the test.sh context and understand the retry logic
echo "=== Show full test.sh file ==="
cat -n .github/workflows/frontier/test.sh

echo ""
echo "=== Check if there's logic to handle deps on retries ==="
rg -n "deps-only|--deps-only" .github/workflows/frontier/

Repository: MFlowCode/MFC

Length of output: 1789


🏁 Script executed:

#!/bin/bash
# Check for selective clean options or dependency caching logic
echo "=== Check mfc.sh clean options ==="
./mfc.sh clean --help 2>&1 | head -30

echo ""
echo "=== Check if build.py has logic to skip dependency rebuild ==="
rg -n "skip.*dep|already.*built|cache.*dep" toolchain/mfc/build.py

echo ""
echo "=== Check if there's a way to preserve install dir ==="
rg -n "install|preserve|keep" toolchain/mfc/build.py | grep -i "dir\|path" | head -15

Repository: MFlowCode/MFC

Length of output: 1842


🏁 Script executed:

#!/bin/bash
# Look for the logic that skips already-built targets
echo "=== Find the skip logic for already-built targets ==="
rg -B5 -A5 "already built and installed this target" toolchain/mfc/build.py

echo ""
echo "=== Check the actual condition ==="
rg -B3 -A8 "If we have already built" toolchain/mfc/build.py

Repository: MFlowCode/MFC

Length of output: 1025


🏁 Script executed:

#!/bin/bash
# Check what is_configured() checks
echo "=== Find is_configured method ==="
rg -B2 -A8 "def is_configured" toolchain/mfc/build.py

echo ""
echo "=== Understanding the check ==="
rg -B5 -A10 "def is_configured" toolchain/mfc/build.py

Repository: MFlowCode/MFC

Length of output: 1281


🏁 Script executed:

#!/bin/bash
# Check what get_staging_dirpath returns and understand the clean issue
echo "=== Verify what clean removes vs what would be needed ==="
rg -B2 -A4 "def get_staging_dirpath" toolchain/mfc/build.py

echo ""
echo "=== Confirm the full picture: check if clean has any options ==="
cat -n mfc.sh | sed -n '60,66p'

echo ""
echo "=== Check if there's an alternative command for targeted cleanup ==="
rg -n "rm -rf.*build\|clean" mfc.sh

Repository: MFlowCode/MFC

Length of output: 945


The retry loop with ./mfc.sh clean will break the --deps-only workflow.

./mfc.sh clean (line 30) removes the entire build/ directory, including both build/staging/ and build/install/. When the build is retried, the system checks for CMakeCache.txt in the staging directory to determine if dependencies are already built. Since clean removes this file, the system will attempt to rebuild all dependencies from source—but compute nodes on HPC systems (like Frontier) lack internet access and cannot fetch external repositories. This guarantees the retry will fail.

Replace ./mfc.sh clean with rm -rf "$(pwd)/build/staging" to preserve the pre-fetched dependencies in build/install/.

🤖 Prompt for AI Agents
In @.github/workflows/frontier/test.sh around lines 18 - 37, The retry loop
currently runs "./mfc.sh clean", which deletes the entire build/ directory
(including build/install/) and removes the staging CMakeCache.txt so the
deps-only workflow cannot detect pre-fetched deps on retry; change the cleanup
step in the loop to remove only the staging artifacts (i.e., delete
build/staging) instead of calling "./mfc.sh clean" so build/install and cached
dependencies remain available for the subsequent attempt and the CMakeCache.txt
check still succeeds.


# Run tests
if [ "$job_device" = "gpu" ]; then
./mfc.sh test -v -a --rdma-mpi --max-attempts 3 -j $ngpus $device_opts -- -c frontier
else
Expand Down
38 changes: 34 additions & 4 deletions .github/workflows/frontier_amd/bench.sh
Original file line number Diff line number Diff line change
@@ -1,20 +1,50 @@
#!/bin/bash

n_ranks=12
build_opts=""
device_opts=""
if [ "$job_device" = "gpu" ]; then
gpus=$(rocm-smi --showid | awk '{print $1}' | grep -Eo '[0-9]+' | uniq | tr '\n' ' ')
n_ranks=$(echo "$gpus" | wc -w) # number of GPUs on node
gpu_ids=$(echo "$gpus" | tr ' ' '\n' | tr '\n' ' ' | sed 's/ $//') # GPU IDs from rocm-smi
device_opts+="--gpu"
build_opts+="--gpu"
if [ "$job_interface" = "acc" ]; then
device_opts+=" acc"
build_opts+=" acc"
elif [ "$job_interface" = "omp" ]; then
device_opts+=" mp"
build_opts+=" mp"
fi
device_opts+=" -g $gpu_ids"
device_opts="$build_opts -g $gpu_ids"
fi

# Build case-optimized binaries on compute node (deps already fetched on login node)
max_attempts=3
attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Build attempt $attempt of $max_attempts..."
build_cmd_ok=true
for dir in benchmarks/*/; do
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
build_cmd_ok=false
break
fi
done

if [ "$build_cmd_ok" = true ]; then
echo "Build succeeded on attempt $attempt."
break
fi

if [ $attempt -lt $max_attempts ]; then
echo "Build failed on attempt $attempt. Cleaning source targets and retrying in 30s..."
./mfc.sh clean -t pre_process simulation post_process syscheck
sleep 30
else
echo "Build failed after $max_attempts attempts."
exit 1
fi
attempt=$((attempt + 1))
done

if [ "$job_device" = "gpu" ]; then
./mfc.sh bench --mem 12 -j $n_ranks -o "$job_slug.yaml" -- -c frontier_amd $device_opts -n $n_ranks
else
Expand Down
41 changes: 4 additions & 37 deletions .github/workflows/frontier_amd/build.sh
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
#!/bin/bash

# Fetch dependencies on login node (internet access).
# Source code is built on compute nodes via test.sh / bench.sh.

# Ignore SIGHUP to survive login node session drops
trap '' HUP

job_device=$1
job_interface=$2
run_bench=$3
build_opts=""
if [ "$job_device" = "gpu" ]; then
build_opts+="--gpu"
Expand All @@ -18,39 +20,4 @@ fi

. ./mfc.sh load -c famd -m g

max_attempts=3
attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Build attempt $attempt of $max_attempts..."
if [ "$run_bench" == "bench" ]; then
build_cmd_ok=true
for dir in benchmarks/*/; do
dirname=$(basename "$dir")
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
build_cmd_ok=false
break
fi
done
else
if ./mfc.sh test -v -a --dry-run -j 8 $build_opts; then
build_cmd_ok=true
else
build_cmd_ok=false
fi
fi

if [ "$build_cmd_ok" = true ]; then
echo "Build succeeded on attempt $attempt."
exit 0
fi

if [ $attempt -lt $max_attempts ]; then
echo "Build failed on attempt $attempt. Cleaning and retrying in 30s..."
./mfc.sh clean
sleep 30
fi
attempt=$((attempt + 1))
done

echo "Build failed after $max_attempts attempts."
exit 1
./mfc.sh build --deps-only -j 8 $build_opts
24 changes: 24 additions & 0 deletions .github/workflows/frontier_amd/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,32 @@ if [ "$job_device" = "gpu" ]; then
elif [ "$job_interface" = "omp" ]; then
device_opts+=" mp"
fi
else
device_opts+=" --no-gpu"
fi

# Build source code on compute node (deps already fetched on login node)
max_attempts=3
attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Build attempt $attempt of $max_attempts..."
if ./mfc.sh test -v -a --dry-run -j 8 $device_opts; then
echo "Build succeeded on attempt $attempt."
break
fi

if [ $attempt -lt $max_attempts ]; then
echo "Build failed on attempt $attempt. Cleaning source targets and retrying in 30s..."
./mfc.sh clean -t pre_process simulation post_process syscheck
sleep 30
else
echo "Build failed after $max_attempts attempts."
exit 1
fi
attempt=$((attempt + 1))
done

# Run tests
if [ "$job_device" = "gpu" ]; then
./mfc.sh test -v -a --max-attempts 3 -j $ngpus $device_opts -- -c frontier_amd
else
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/phoenix/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ while [ $attempt -le $max_attempts ]; do
fi

if [ $attempt -lt $max_attempts ]; then
echo "Build failed on attempt $attempt. Cleaning and retrying in 30s..."
./mfc.sh clean
echo "Build failed on attempt $attempt. Cleaning source targets and retrying in 30s..."
./mfc.sh clean -t pre_process simulation post_process syscheck
sleep 30
else
echo "Build failed after $max_attempts attempts."
Expand Down
23 changes: 23 additions & 0 deletions .github/workflows/phoenix/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash

# Fetch dependencies on login node (internet access).
# Source code is built on compute nodes via test.sh / bench.sh.

# Ignore SIGHUP to survive login node session drops
trap '' HUP

job_device=$1
job_interface=$2
build_opts=""
if [ "$job_device" = "gpu" ]; then
build_opts+="--gpu"
if [ "$job_interface" = "acc" ]; then
build_opts+=" acc"
elif [ "$job_interface" = "omp" ]; then
build_opts+=" mp"
fi
fi

. ./mfc.sh load -c p -m $job_device

./mfc.sh build --deps-only -j $(nproc) $build_opts
Loading
Loading