Skip to content

Conversation

@ttsuuubasa
Copy link
Contributor

  • One-line PR description: updating KEP docs for promotion to beta
  • Other comments:
    This PR promotes the DRA Device Binding Conditions feature from alpha to beta for Kubernetes v1.36, with enhancements based on DRA driver developer's feedback and metrics.

    1. Stage Promotion to Beta

    • Updated stage: alphastage: beta in kep.yaml
    • Updated latest-milestone: "v1.35""v1.36"

    2. Enhanced DRA Driver Developer's Feedback

    • CoHDI: Added comprehensive feedback from CoHDI (Composable Hardware Device Infrastructure) testing, including scenarios for device pool changes and external controller bug identification
    • NVIDIA DRA Driver: Added feedback from NVIDIA's k8s-dra-driver-gpu showcasing ComputeDomain support with Multi-Node NVLink and IMEX technology

    3. Improved Monitoring & Observability

    • New Metrics: Introduced two new metrics for better operational visibility:
      • scheduler_dra_bindingconditions_allocations_total:
        tracks scheduling attempts with success/failure/timeout status
      • scheduler_dra_bindingconditions_prebind_duration_seconds:
        measures PreBind phase duration with detailed labels
    • Enhanced Detection: Replaced event log-based monitoring with metric-based detection for better automation

    4. Clarified Feature Scope

    • Added explicit non-goal: device pool migration as happy-path flow (deferred to separate KEP)

    NOTE:
    I addressed comments and suggestions from @johnbelamaric during the v1.35 review cycle:
    KEP-5007: DRA Device Binding Conditions alpha in 1.35 #5487

/wg device-management
/sig scheduling
/cc @pohly @johnbelamaric @dom4ha

@k8s-ci-robot k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jan 28, 2026
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jan 28, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @ttsuuubasa. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 28, 2026
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jan 28, 2026
Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure to request a PRR review for the beta promotion.

@github-project-automation github-project-automation bot moved this to Needs Review in SIG Scheduling Jan 29, 2026
@kannon92
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 29, 2026
- Updated the Production Readiness Review questionnaire
  and introduced metrics for troubleshooting and operations.
- Addressed review comments from the v1.35 PR kubernetes#5487.
- Added Graduation Criteria for beta.
- Clarify that happy-path device migration is out of scope for beta criteria

Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
@ttsuuubasa ttsuuubasa force-pushed the dra-device-binding-conditions branch from cbc56b6 to 55e918d Compare January 29, 2026 06:54
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ttsuuubasa
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho, soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ttsuuubasa
Copy link
Contributor Author

@kannon92
I received ack from @johnbelamaric to be the PRR review approver, and updated prod-readiness/sig-scheduling/5007.yaml accordingly.
Please let me know if this is the correct procedure.

- Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down
- Additional tests are in Testgrid and linked in KEP
- Scheduler supports timeout configuration via command-line argument
- In this use case, the attachment scenario for moving devices between different pools is achieved through re-scheduling triggered by BindingFailureConditions. However, there remains an issue that device migration needs to be implemented using BindingConditions as a happy‑path flow. This will be addressed in a separate KEP and will be considered out of scope for the beta-graduation criteria.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for decoupling Binding Conditions from the attachment, but I'm a bit skeptical whether the problem can be fixed easily (see discussion in kubernetes/kubernetes#135473 (comment)), so the question is whether the happy-path is really good enough and proved working?

@wojtek-t @sanposhiho @macsko WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I consider having at least one user of the happy path at the prototype stage (= PR fully implemented and reviewed, but maybe not merged because of release timing) sufficient for beta. But we should have one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "happy path" I meant the one we have right now, i.e. without update the allocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The happy path is planned to be implemented in NVIDIA's ComputeDomain case.
My team members are currently working on the implementation. We plan to submit a pull request to NVIDIA's DRA GitHub within the next two or three days.

@ttsuuubasa ttsuuubasa mentioned this pull request Feb 3, 2026
20 tasks
@ttsuuubasa
Copy link
Contributor Author

@johnbelamaric
We have agreement from @pohly and @dom4ha that the happy path for binding conditions specifically in the device‑attachment scenario is out of scope for this KEP, and that the feature can still proceed to beta graduation.
Given that, I’d like to hear your opinion as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Projects

Status: 👀 In review
Status: Needs Review

Development

Successfully merging this pull request may close these issues.

6 participants