-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5007: DRA Device Binding Conditions beta in 1.36 #5846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
KEP-5007: DRA Device Binding Conditions beta in 1.36 #5846
Conversation
|
Hi @ttsuuubasa. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
kannon92
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure to request a PRR review for the beta promotion.
|
/ok-to-test |
- Updated the Production Readiness Review questionnaire and introduced metrics for troubleshooting and operations. - Addressed review comments from the v1.35 PR kubernetes#5487. - Added Graduation Criteria for beta. - Clarify that happy-path device migration is out of scope for beta criteria Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
cbc56b6 to
55e918d
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ttsuuubasa The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@kannon92 |
| - Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down | ||
| - Additional tests are in Testgrid and linked in KEP | ||
| - Scheduler supports timeout configuration via command-line argument | ||
| - In this use case, the attachment scenario for moving devices between different pools is achieved through re-scheduling triggered by BindingFailureConditions. However, there remains an issue that device migration needs to be implemented using BindingConditions as a happy‑path flow. This will be addressed in a separate KEP and will be considered out of scope for the beta-graduation criteria. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for decoupling Binding Conditions from the attachment, but I'm a bit skeptical whether the problem can be fixed easily (see discussion in kubernetes/kubernetes#135473 (comment)), so the question is whether the happy-path is really good enough and proved working?
@wojtek-t @sanposhiho @macsko WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I consider having at least one user of the happy path at the prototype stage (= PR fully implemented and reviewed, but maybe not merged because of release timing) sufficient for beta. But we should have one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With "happy path" I meant the one we have right now, i.e. without update the allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The happy path is planned to be implemented in NVIDIA's ComputeDomain case.
My team members are currently working on the implementation. We plan to submit a pull request to NVIDIA's DRA GitHub within the next two or three days.
|
@johnbelamaric |
Other comments:
This PR promotes the DRA Device Binding Conditions feature from alpha to beta for Kubernetes v1.36, with enhancements based on DRA driver developer's feedback and metrics.
1. Stage Promotion to Beta
stage: alpha→stage: betain kep.yamllatest-milestone: "v1.35"→"v1.36"2. Enhanced DRA Driver Developer's Feedback
3. Improved Monitoring & Observability
scheduler_dra_bindingconditions_allocations_total:tracks scheduling attempts with success/failure/timeout status
scheduler_dra_bindingconditions_prebind_duration_seconds:measures PreBind phase duration with detailed labels
4. Clarified Feature Scope
NOTE:
I addressed comments and suggestions from @johnbelamaric during the v1.35 review cycle:
KEP-5007: DRA Device Binding Conditions alpha in 1.35 #5487
/wg device-management
/sig scheduling
/cc @pohly @johnbelamaric @dom4ha