Skip to content

fix(pathfinder): use CTK canary fallback for header discovery#1731

Merged
rwgk merged 3 commits intoNVIDIA:mainfrom
cpcloud:issue-1797-cuda-path-in-containers
Mar 6, 2026
Merged

fix(pathfinder): use CTK canary fallback for header discovery#1731
rwgk merged 3 commits intoNVIDIA:mainfrom
cpcloud:issue-1797-cuda-path-in-containers

Conversation

@cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Mar 6, 2026

Summary

  • Reuse CTK canary probing in CTK header discovery when site-packages, conda, and CUDA_HOME/CUDA_PATH do not resolve headers.
  • Derive CTK root from a system-loaded cudart path and search CTK include layout from that root, returning found_via="system-ctk-root".
  • Add coverage for fallback success, precedence (CUDA_HOME before canary), and non-fatal canary-miss behavior; update CTK header search-order docs.

Closes #1707.

Test plan

  • pixi run pytest tests/test_find_nvidia_headers.py tests/test_ctk_root_discovery.py (from cuda_pathfinder/)
  • pixi run test (repo root)

Made with Cursor

Reuse the CTK root canary probe for CTK header lookup when site-packages, conda, and CUDA_HOME/CUDA_PATH are unavailable, avoiding hardcoded default install paths. Add tests for fallback success, search-order precedence, and non-fatal canary miss behavior.

Made-with: Cursor
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Mar 6, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cpcloud
Copy link
Contributor Author

cpcloud commented Mar 6, 2026

/ok to test

@github-actions

This comment has been minimized.

@cpcloud cpcloud requested a review from rwgk March 6, 2026 15:50
Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! After removing the try-except, we could pull the content of #1728 in here, update the release notes to include this PR, and still release today.

"""
try:
canary_abs_path = _resolve_system_loaded_abs_path_in_subprocess("cudart")
except (ChildProcessError, RuntimeError):
Copy link
Collaborator

@rwgk rwgk Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try-except will mask bugs, we should remove it completely here.

The canary probe feature will become critical for users, therefore we should hold the implementation quality to the same standards as any other pathfinder code, since it's entirely owned by us. The only failure we're expecting is DynamicLibNotFoundError, which is already handled in canary_probe_subprocess.py. Everything else is a bug we want to know about and fix, so that users can have full confidence in the feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Fixing now.

Avoid masking canary subprocess failures during CTK header discovery so probe bugs are visible. Update header-discovery tests so only a None canary result is non-fatal while runtime probe errors are asserted.

Made-with: Cursor
Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're faster here than I'm with my cupti PR. I'll set this to auto-merge, then I'll update my release notes PR.

@rwgk rwgk enabled auto-merge (squash) March 6, 2026 16:51
@rwgk
Copy link
Collaborator

rwgk commented Mar 6, 2026

/ok to test 78be34e

@cpcloud
Copy link
Contributor Author

cpcloud commented Mar 6, 2026

/ok to test

rwgk added a commit to rwgk/cuda-python that referenced this pull request Mar 6, 2026
@rwgk rwgk merged commit ef9253b into NVIDIA:main Mar 6, 2026
86 checks passed
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Doc Preview CI
Preview removed because the pull request was closed or merged.

rwgk added a commit that referenced this pull request Mar 6, 2026
* Add Linux support for loading libcupti.so.12 and libcupti.so.13

This commit adds support for finding and loading CUPTI libraries on Linux
through cuda.pathfinder. It implements support for all enumerated installation
methods:

- Site-packages: nvidia/cuda_cupti/lib (CUDA 12) and nvidia/cu13/lib (CUDA 13)
- Conda: $CONDA_PREFIX/lib (colocated with other CUDA libraries)
- CTK via CUDA_HOME: $CUDA_HOME/extras/CUPTI/lib64
- CTK via canary probe: system CTK root discovery (similar to nvvm)

Changes:
- Add 'cupti' to supported library names and SONAMEs
- Add site-packages paths for CUDA 12 and 13
- Add cupti to CTK root canary discoverable libraries
- Update find_nvidia_dynamic_lib to handle extras/CUPTI/lib64 path
- Add logic to distinguish CTK (extras/CUPTI/lib64) vs conda (lib) paths
- Update _find_so_using_lib_dir to support versioned libraries via glob
- Add comprehensive mock tests covering all installation methods

Fixes #1572 (Linux support)

Made-with: Cursor

* Update cupti tests to use new SearchContext-based API

Migrated test_load_nvidia_dynamic_lib_using_mocker.py from the old
_FindNvidiaDynamicLib API to the new descriptor-based SearchContext API.

Changes:
- Replace _FindNvidiaDynamicLib imports with search_steps and load_nvidia_dynamic_lib modules
- Update mocks to use run_find_steps, LOADER, and SearchContext
- Use LIB_DESCRIPTORS to get cupti descriptor
- Update all test functions to work with the new search step architecture

Made-with: Cursor

* Remove unused CTK canary variables from supported_nvidia_libs.py

These variables (_CTK_ROOT_CANARY_ANCHOR_LIBNAMES and
_CTK_ROOT_CANARY_DISCOVERABLE_LIBNAMES) were added in the cupti PR but
are not used in the new descriptor-based architecture. The new code
uses desc.ctk_root_canary_anchor_libnames directly from descriptors.

Made-with: Cursor

* Improve comment for change in LinuxSearchPlatform.find_in_lib_dir()

* Add cputi to cu12, cu13 groups in cuda_pathfinder/pyproject.toml

* Add cuda_cupti to cuda-components in .github/actions/fetch_ctk/action.yml

* Add windows_dlls, site_packages_windows, anchor_rel_dirs_windows for cupti in /descriptor_catalog.py

* test: Refactor cupti mock tests to focus on Conda and error paths

Remove tests covered by real CI:
- Site-packages tests (CUDA 12 and 13) - covered by real CI
- CTK tests (CUDA_HOME and canary probe) - covered by real CI
- Search order tests involving site-packages/CTK - covered by real CI

Keep tests not covered by real CI:
- Conda discovery test - Conda not covered by real CI
- Error path test (not found) - error path not covered
- Conda vs CTK search order test - Conda not covered by real CI

Also remove unused imports and helper functions.

Made-with: Cursor

* Add pathfinder release/1.4.1-notes.rst

* Add PR #1731 to release/1.4.1-notes.rst
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA]: CuPy fails to auto-detect CUDA when CUDA_PATH is unset in containers

2 participants