Skip to content

CI: ensure artifacts can be fetched successfully#1725

Merged
cpcloud merged 1 commit intoNVIDIA:mainfrom
leofang:fix_fetching_prev
Mar 6, 2026
Merged

CI: ensure artifacts can be fetched successfully#1725
cpcloud merged 1 commit intoNVIDIA:mainfrom
leofang:fix_fetching_prev

Conversation

@leofang
Copy link
Member

@leofang leofang commented Mar 5, 2026

@leofang leofang requested a review from mdboom March 5, 2026 03:00
@leofang leofang self-assigned this Mar 5, 2026
@leofang leofang added bug Something isn't working CI/CD CI/CD infrastructure labels Mar 5, 2026
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Mar 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@leofang leofang added the to-be-backported Trigger the bot to raise a backport PR upon merge label Mar 5, 2026
@leofang leofang changed the title CI: ensure the backport CI can fetch artifacts successfully CI: ensure artifacts can be fetched successfully Mar 5, 2026
@leofang leofang removed the to-be-backported Trigger the bot to raise a backport PR upon merge label Mar 5, 2026
@leofang
Copy link
Member Author

leofang commented Mar 5, 2026

/ok to test a403f90

@github-actions

This comment has been minimized.

@cpcloud cpcloud merged commit cfbda9f into NVIDIA:main Mar 6, 2026
170 of 175 checks passed
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Doc Preview CI
Preview removed because the pull request was closed or merged.

@rwgk
Copy link
Collaborator

rwgk commented Mar 6, 2026

I took a quick look at this PR, purely out of curiosity. After Cursor explained what this PR does I asked:

Hm, wouldn't it be safer to fail if the last run on the backport branch was not successful?

See below for the response triggered by that one question. Does the response make sense?


Initial Question

Asked about PR 1725 (commit cfbda9fd0c) which changed -s completed to -s success in GitHub Actions workflow files.

What the Original Change Did

The PR changed the gh run list command from:

  • -s completed → includes all finished runs (success, failure, cancelled, etc.)
  • -s success → only includes runs that completed successfully

This was done in three workflow files:

  • .github/workflows/build-wheel.yml
  • .github/workflows/test-wheel-linux.yml
  • .github/workflows/test-wheel-windows.yml

The purpose was to ensure backport CI only downloads artifacts from successful runs, not failed ones.

Follow-Up Discussion

Question: "Wouldn't it be safer to fail if the last run on the backport branch was not successful?"

Rationale: The current approach (-s success) would silently skip over failed runs and use artifacts from an older successful run. This could mask problems on the backport branch. It would be better to explicitly check the most recent run and fail if it's not successful.

Solution Implemented

Updated all three workflow files to:

  1. Get the most recent run regardless of status (removed the -s success filter)
  2. Check if the run is completed (not still in progress)
  3. Check if the conclusion is "success"
  4. Fail with clear error messages if either check fails

Benefits

  • Fails fast: If the backport branch has failing CI, backport CI fails immediately with a clear message
  • Prevents masking: Avoids silently using artifacts from an older successful run when the latest run failed
  • Better error messages: Explains what went wrong and what needs to be fixed

Changes Made

Diff: .github/workflows/build-wheel.yml

--- a/.github/workflows/build-wheel.yml
+++ b/.github/workflows/build-wheel.yml
@@ -369,9 +369,26 @@ jobs:
 
           OLD_BRANCH=$(yq '.backport_branch' ci/versions.yml)
           OLD_BASENAME="cuda-bindings-python${PYTHON_VERSION_FORMATTED}-cuda*-${{ inputs.host-platform }}*"
-          LATEST_PRIOR_RUN_ID=$(gh run list -b ${OLD_BRANCH} -L 1 -w "ci.yml" -s success -R NVIDIA/cuda-python --json databaseId | jq '.[]| .databaseId')
+          # Get the most recent run (regardless of status) and verify it's successful
+          LATEST_RUN_DATA=$(gh run list -b ${OLD_BRANCH} -L 1 -w "ci.yml" -R NVIDIA/cuda-python --json databaseId,conclusion,status)
+          LATEST_PRIOR_RUN_ID=$(echo "${LATEST_RUN_DATA}" | jq -r '.[0].databaseId // empty')
+          LATEST_RUN_CONCLUSION=$(echo "${LATEST_RUN_DATA}" | jq -r '.[0].conclusion // empty')
+          LATEST_RUN_STATUS=$(echo "${LATEST_RUN_DATA}" | jq -r '.[0].status // empty')
+          
           if [[ "$LATEST_PRIOR_RUN_ID" == "" ]]; then
-            echo "LATEST_PRIOR_RUN_ID not found!"
+            echo "No workflow runs found on branch ${OLD_BRANCH}!"
+            exit 1
+          fi
+          
+          if [[ "$LATEST_RUN_STATUS" != "completed" ]]; then
+            echo "The most recent CI run on branch ${OLD_BRANCH} is still ${LATEST_RUN_STATUS} (not completed)."
+            echo "Please wait for the CI to complete before running backport CI."
+            exit 1
+          fi
+          
+          if [[ "$LATEST_RUN_CONCLUSION" != "success" ]]; then
+            echo "ERROR: The most recent CI run on branch ${OLD_BRANCH} did not succeed (conclusion: ${LATEST_RUN_CONCLUSION})."
+            echo "Backport CI requires artifacts from a successful run. Please fix the CI failures on ${OLD_BRANCH} first."
             exit 1
           fi

Diff: .github/workflows/test-wheel-linux.yml

--- a/.github/workflows/test-wheel-linux.yml
+++ b/.github/workflows/test-wheel-linux.yml
@@ -151,9 +151,26 @@ jobs:
 
           OLD_BRANCH=${{ needs.compute-matrix.outputs.OLD_BRANCH }}
           OLD_BASENAME="cuda-bindings-python${PYTHON_VERSION_FORMATTED}-cuda*-${{ inputs.host-platform }}*"
-          LATEST_PRIOR_RUN_ID=$(gh run list -b ${OLD_BRANCH} -L 1 -w "ci.yml" -s success -R NVIDIA/cuda-python --json databaseId | jq '.[]| .databaseId')
+          # Get the most recent run (regardless of status) and verify it's successful
+          LATEST_RUN_DATA=$(gh run list -b ${OLD_BRANCH} -L 1 -w "ci.yml" -R NVIDIA/cuda-python --json databaseId,conclusion,status)
+          LATEST_PRIOR_RUN_ID=$(echo "${LATEST_RUN_DATA}" | jq -r '.[0].databaseId // empty')
+          LATEST_RUN_CONCLUSION=$(echo "${LATEST_RUN_DATA}" | jq -r '.[0].conclusion // empty')
+          LATEST_RUN_STATUS=$(echo "${LATEST_RUN_DATA}" | jq -r '.[0].status // empty')
+          
           if [[ "$LATEST_PRIOR_RUN_ID" == "" ]]; then
-            echo "LATEST_PRIOR_RUN_ID not found!"
+            echo "No workflow runs found on branch ${OLD_BRANCH}!"
+            exit 1
+          fi
+          
+          if [[ "$LATEST_RUN_STATUS" != "completed" ]]; then
+            echo "The most recent CI run on branch ${OLD_BRANCH} is still ${LATEST_RUN_STATUS} (not completed)."
+            echo "Please wait for the CI to complete before running backport CI."
+            exit 1
+          fi
+          
+          if [[ "$LATEST_RUN_CONCLUSION" != "success" ]]; then
+            echo "ERROR: The most recent CI run on branch ${OLD_BRANCH} did not succeed (conclusion: ${LATEST_RUN_CONCLUSION})."
+            echo "Backport CI requires artifacts from a successful run. Please fix the CI failures on ${OLD_BRANCH} first."
+            exit 1
+          fi

Diff: .github/workflows/test-wheel-windows.yml

--- a/.github/workflows/test-wheel-windows.yml
+++ b/.github/workflows/test-wheel-windows.yml
@@ -137,12 +137,28 @@ jobs:
         run: |
           $OLD_BRANCH = yq '.backport_branch' ci/versions.yml
           $OLD_BASENAME = "cuda-bindings-python${env:PYTHON_VERSION_FORMATTED}-cuda*-${{ inputs.host-platform }}*"
-          $runData = gh run list -b $OLD_BRANCH -L 1 -w "ci.yml" -s success -R NVIDIA/cuda-python --json databaseId | ConvertFrom-Json
+          # Get the most recent run (regardless of status) and verify it's successful
+          $runData = gh run list -b $OLD_BRANCH -L 1 -w "ci.yml" -R NVIDIA/cuda-python --json databaseId,conclusion,status | ConvertFrom-Json
           if (-not $runData -or $runData.Length -eq 0 -or -not $runData[0].databaseId -or [string]::IsNullOrEmpty($runData[0].databaseId)) {
-              Write-Host "LATEST_PRIOR_RUN_ID not found!"
+              Write-Host "No workflow runs found on branch $OLD_BRANCH!"
               exit 1
           }
+          
           $LATEST_PRIOR_RUN_ID = $runData[0].databaseId
+          $LATEST_RUN_STATUS = $runData[0].status
+          $LATEST_RUN_CONCLUSION = $runData[0].conclusion
+          
+          if ($LATEST_RUN_STATUS -ne "completed") {
+              Write-Host "The most recent CI run on branch $OLD_BRANCH is still $LATEST_RUN_STATUS (not completed)."
+              Write-Host "Please wait for the CI to complete before running backport CI."
+              exit 1
+          }
+          
+          if ($LATEST_RUN_CONCLUSION -ne "success") {
+              Write-Host "ERROR: The most recent CI run on branch $OLD_BRANCH did not succeed (conclusion: $LATEST_RUN_CONCLUSION)."
+              Write-Host "Backport CI requires artifacts from a successful run. Please fix the CI failures on $OLD_BRANCH first."
+              exit 1
+          }
 
           gh run download $LATEST_PRIOR_RUN_ID -p $OLD_BASENAME -R NVIDIA/cuda-python

Summary

The improvement ensures that backport CI will fail fast and clearly if the backport branch has failing CI, rather than silently using artifacts from an older successful run. This prevents problems from being masked and provides better visibility into CI health on backport branches.

@leofang leofang deleted the fix_fetching_prev branch March 7, 2026 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working CI/CD CI/CD infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants