-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[WIP] KEP-4671: Beta Promotion #5854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
mm4tt
commented
Jan 30, 2026
- One-line PR description: Promoting KEP-4671 Gang Scheduling to Beta in 1.36
- Issue link: Gang Scheduling Support in Kubernetes #4671
- Other comments:
|
Skipping CI for Draft Pull Request. |
f25e37a to
25677bc
Compare
8e713ff to
4f72681
Compare
4f72681 to
d4bef89
Compare
| specific group. | ||
| - Events: Repeated `FailedScheduling` events on the Pods with workloadRef. | ||
| - Mitigations: If the gang cannot fit due to resource constraints, delete the Workload object which should disable | ||
| the gang-scheduling TODO(mm4tt@): Discuss with Wojtek |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t this is related to the opt-out discussion we need to have. PTAL and let me know your thoughts
| - `scheduler_pod_group_scheduling_attempts_total` | ||
| - `scheduler_pod_group_scheduling_duration_seconds` | ||
| - `scheduler_pod_group_scheduling_algorithm_duration_seconds` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metric names look good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soltysh - FYI regarding #5558 (comment)
These are matching what we were talking about 4 months ago.
| 5. Create a Workload object named gang-test with minCount=2. | ||
| 6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test. | ||
| 7. The Pod stays in Pending state (waiting for the gang). We verified that | ||
| `scheduler_pod_group_scheduling_attempts_total` metric is incremented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric should not be incremented as pods will be blocked on PreEnqueue, so workloads cycle should not be triggered yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, is there any other metric / event that we can use here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just check "pending_pods" instead?
The pods are blocked, so will they be reported in the "gated" queue? Or "gated" is only for the ones with scheduling gates? @macsko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pods are blocked, so will they be reported in the "gated" queue?
Right, the pods blocked on PreEnqueue will be counted as "gated" in "pending_pods" metric
| 11. Create test-pod-3 and test-pod-4 pointing to a workload. | ||
| 12. The pods are scheduled immediately one-by-one (Workload logic is ignored/unavailable because the field is dropped). | ||
| 13. Upgrade API Server and Scheduler back to v1.36. | ||
| 14. Create new pods referencing a Workload; verifying that Gang Scheduling functionality is restored (pods wait for minCount before scheduling). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean that pod3 and pod4 won't be considered in calculating minCount, but only new pods will, is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I had in mind is that we create a new workload with minCount=2 and pod3-4 pointing to it in step 11.
I don't want consider cases where minCount < podCount as it's semantically ambiguous.
However, now I realized we won't be able to create a new workload in step 11 because the API is disabled.
Updated the test scenario to fix that.
| ###### What steps should be taken if SLOs are not being met to determine the problem? | ||
|
|
||
| 1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and | ||
| `scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if scheduler_pod_group_scheduling_algorithm_duration_seconds (the one with algorithm) brings anything over the one without algorithm when there is no TAS nor WAS preemption yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered in the other comment, PTAL.
| Not required until feature graduated to beta. | ||
| - Testing: Are there any tests for failure mode? If not, describe why. | ||
| --> | ||
| - Pods Pending Indefinitely (Gang Starvation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be two main cases:
- Pods waiting in PreEnqueue until minCount is reached
- Pods cannot be scheduled because minCount pods does not fit
In both cases we should have pod status set informing about the reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, split into two cases.
| metrics: | ||
| - scheduler_pod_group_scheduling_attempts_total | ||
| - scheduler_pod_group_scheduling_duration_seconds | ||
| - scheduler_pod_group_scheduling_algorithm_duration_seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric should not bring any new information over the metric above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we saying we'll not be adding it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, they serve different purposes, similar to standard pod scheduling metrics. scheduler_pod_group_scheduling_duration_seconds covers the end-to-end latency of the cycle (including queue operations, snapshotting, etc.), while _algorithm_duration_seconds measures strictly the core calculation time. Having both allows us to distinguish whether a potential regression is caused by the algorithm's complexity or by system overheads (like snapshotting or queue locking), doesn't it?
mm4tt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, PTAL
| 5. Create a Workload object named gang-test with minCount=2. | ||
| 6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test. | ||
| 7. The Pod stays in Pending state (waiting for the gang). We verified that | ||
| `scheduler_pod_group_scheduling_attempts_total` metric is incremented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, is there any other metric / event that we can use here?
| 11. Create test-pod-3 and test-pod-4 pointing to a workload. | ||
| 12. The pods are scheduled immediately one-by-one (Workload logic is ignored/unavailable because the field is dropped). | ||
| 13. Upgrade API Server and Scheduler back to v1.36. | ||
| 14. Create new pods referencing a Workload; verifying that Gang Scheduling functionality is restored (pods wait for minCount before scheduling). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I had in mind is that we create a new workload with minCount=2 and pod3-4 pointing to it in step 11.
I don't want consider cases where minCount < podCount as it's semantically ambiguous.
However, now I realized we won't be able to create a new workload in step 11 because the API is disabled.
Updated the test scenario to fix that.
| metrics: | ||
| - scheduler_pod_group_scheduling_attempts_total | ||
| - scheduler_pod_group_scheduling_duration_seconds | ||
| - scheduler_pod_group_scheduling_algorithm_duration_seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, they serve different purposes, similar to standard pod scheduling metrics. scheduler_pod_group_scheduling_duration_seconds covers the end-to-end latency of the cycle (including queue operations, snapshotting, etc.), while _algorithm_duration_seconds measures strictly the core calculation time. Having both allows us to distinguish whether a potential regression is caused by the algorithm's complexity or by system overheads (like snapshotting or queue locking), doesn't it?
| ###### What steps should be taken if SLOs are not being met to determine the problem? | ||
|
|
||
| 1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and | ||
| `scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered in the other comment, PTAL.
| Not required until feature graduated to beta. | ||
| - Testing: Are there any tests for failure mode? If not, describe why. | ||
| --> | ||
| - Pods Pending Indefinitely (Gang Starvation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, split into two cases.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mm4tt The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| alpha: | ||
| approver: "@soltysh" | ||
| beta: | ||
| approver: "@soltysh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't comment on the lines that didn't change, so adding some comments here:
How can this feature be enabled / disabled in a live cluster?
As part of Beta, we will be adding two new feature gates (that may go directly to beta):
- WorkloadSchedulingCycle - to gate the logic related to it
- DelayedPreemption - to gate delayed preemption logic
We should reflect that in the answer
Does enabling the feature change any default behavior?
Technically, we will use delayed preemption also for pod-by-pod. It should be no-op from end-user perspective, but maybe it's worth adding it there too?
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
This question needs adjustments to reflect the newly introduced gates - see above
Are there any tests for feature enablement/disablement?
Has this been implemented? If not, please ensure that we will not promote to beta without this test.
[Or maybe we can even implement it in the meantime?]
| 5. Create a Workload object named gang-test with minCount=2. | ||
| 6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test. | ||
| 7. The Pod stays in Pending state (waiting for the gang). We verified that | ||
| `scheduler_pod_group_scheduling_attempts_total` metric is incremented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just check "pending_pods" instead?
The pods are blocked, so will they be reported in the "gated" queue? Or "gated" is only for the ones with scheduling gates? @macsko
| - Details: | ||
| - [x] API .spec | ||
| - Other field: workloadRef is set on the Pods. | ||
| - [x] Events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@macsko - are we going to have this event?
Anyway - I think this is misleading, because the lack of these events doesn't mean it's not working, so I would remove it anyway.
|
|
||
| ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | ||
| - Scheduling Throughput: There should be no significant regression in the system-wide scheduling throughput (pods/s) | ||
| when scheduling pods attached to a Workload compared to scheduling an equivalent number of individual pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you measure it using existing metrics?
[I'm silently assumign that it's via API server metrics counting "/binding" calls, but would be good to clairfy.
| - `scheduler_pod_group_scheduling_attempts_total` | ||
| - `scheduler_pod_group_scheduling_duration_seconds` | ||
| - `scheduler_pod_group_scheduling_algorithm_duration_seconds` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soltysh - FYI regarding #5558 (comment)
These are matching what we were talking about 4 months ago.