Skip to content

Commit a3cef82

Browse files
committed
KEP-5075: Update to BETA, Add Troubleshooting and Metrics
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
1 parent 0463ac3 commit a3cef82

File tree

3 files changed

+32
-29
lines changed

3 files changed

+32
-29
lines changed

keps/prod-readiness/sig-scheduling/5075.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44
kep-number: 5075
55
alpha:
66
approver: "@deads2k"
7+
beta:
8+
approver: "@deads2k"

keps/sig-scheduling/5075-dra-consumable-capacity/README.md

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
105105
- [x] (R) KEP approvers have approved the KEP status as `implementable`
106106
- [x] (R) Design details are appropriately documented
107107
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
108-
- [ ] e2e Tests for all Beta API Operations (endpoints)
108+
- [x] e2e Tests for all Beta API Operations (endpoints)
109109
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
110110
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
111111
- [ ] (R) Graduation criteria is in place
@@ -832,8 +832,9 @@ The following functionalities should be covered in E2E tests:
832832

833833
- Feature Gates are enabled by default.
834834
- No major outstanding bugs.
835-
- 1 example of real-world use case.
835+
- 2 examples of real-world use cases.
836836
- CNI DRA driver (kubernetes-sigs/cni-dra-driver) can use this feature to manage and limit bandwidth quota.
837+
- DRA Driver for CPU (kubernetes-sigs/dra-driver-cpu) can use this feature to manage and limit CPU resources.
837838
- Feedback collected from the community (developers and users) with adjustments provided, implemented and tested.
838839

839840
### GA
@@ -1138,7 +1139,7 @@ Pick one more of these and delete the rest.
11381139

11391140
- [x] Metrics
11401141
- Metric names:
1141-
- `apiserver_request` with `resource="resourceclaims", subresource="status"`
1142+
- `apiserver_request` with `resource="resourceclaims"`
11421143
- `scheduler_plugin_execution_duration_seconds` with `plugin="DynamicResources"`
11431144
- For state gathering, `extension_point="PreFilter"`
11441145
- For allocation, `extension_point="Filter"`
@@ -1297,36 +1298,31 @@ No.
12971298

12981299
### Troubleshooting
12991300

1300-
<!--
1301-
This section must be completed when targeting beta to a release.
1302-
1303-
For GA, this section is required: approvers should be able to confirm the
1304-
previous answers based on experience in the field.
1305-
1306-
The Troubleshooting section currently serves the `Playbook` role. We may consider
1307-
splitting it into a dedicated `Playbook` document (potentially with some monitoring
1308-
details). For now, we leave it here.
1309-
-->
1301+
The troubleshooting section in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters#troubleshooting
1302+
still applies. The only additional failure modes comes from version skew
1303+
in the cluster and the troubleshooting steps provided through the link above
1304+
should be sufficient to determine the cause.
13101305

13111306
###### How does this feature react if the API server and/or etcd is unavailable?
13121307

1308+
See https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters#how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable.
1309+
13131310
###### What are other known failure modes?
13141311

1315-
<!--
1316-
For each of them, fill in the following information by copying the below template:
1317-
- [Failure mode brief description]
1318-
- Detection: How can it be detected via metrics? Stated another way:
1319-
how can an operator troubleshoot without logging into a master or worker node?
1320-
- Mitigations: What can be done to stop the bleeding, especially for already
1321-
running user workloads?
1322-
- Diagnostics: What are the useful log messages and their required logging
1323-
levels that could help debug the issue?
1324-
Not required until feature graduated to beta.
1325-
- Testing: Are there any tests for failure mode? If not, describe why.
1326-
-->
1312+
See https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters#what-are-other-known-failure-modes.
1313+
1314+
- kube-scheduler cannot allocate ResourceClaims.
1315+
1316+
The shared device may not have sufficient capacity to satisfy the request. The log message `Device capacity not enough` and the `capacities` field in the log `Allocating one device` can provide further clues for investigation (require -v=7 on kube-scheduler).
1317+
1318+
If the feature is disabled but a ResourceClaim still requests capacity, the scheduler log will report:
1319+
has capacity requests, but the DRAConsumableCapacity feature is disabled. Nevertheless, when using the allocator in stable mode, no logs related to the DRAConsumableCapacity feature will be emitted.
1320+
13271321

13281322
###### What steps should be taken if SLOs are not being met to determine the problem?
13291323

1324+
N/A
1325+
13301326
## Implementation History
13311327

13321328
<!--
@@ -1352,6 +1348,10 @@ Alpha 1.35:
13521348
- [Fix 134519 - add ShareID to kubelet plugin API PR 134520](https://github.com/kubernetes/kubernetes/pull/134520) has been pushed on 2025-10-10
13531349
- [Increase test coverage PR 134615](https://github.com/kubernetes/kubernetes/pull/134615) has been pushed on 2025-10-15
13541350

1351+
Beta 1.36:
1352+
1353+
- [Promote DRAConsumableCapacity to Beta PR 136611](https://github.com/kubernetes/kubernetes/pull/136611) has been pushed on 2026-01-29
1354+
13551355
## Drawbacks
13561356

13571357
<!--

keps/sig-scheduling/5075-dra-consumable-capacity/kep.yaml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,12 @@ see-also:
2424
- "/keps/sig-scheduling/5007-device-attach-before-pod-scheduled"
2525

2626
# The target maturity stage in the current dev cycle for this KEP.
27-
stage: alpha
27+
stage: beta
2828

2929
# The most recent milestone for which work toward delivery of this KEP has been
3030
# done. This can be the current (upcoming) milestone, if it is being actively
3131
# worked on.
32-
latest-milestone: "v1.35"
32+
latest-milestone: "v1.36"
3333

3434
# The milestone at which this feature was, or is targeted to be, at each stage.
3535
milestone:
@@ -48,5 +48,6 @@ feature-gates:
4848
disable-supported: true
4949

5050
# The following PRR answers are required at beta release
51-
metrics: []
52-
# - my_feature_metric
51+
metrics:
52+
- apiserver_request_total{group="resource.k8s.io", version="v1", resource="resourceclaims"}
53+
- scheduler_plugin_execution_duration_seconds{plugin="DynamicResources"}

0 commit comments

Comments
 (0)