[WIP] KEP-4671: Beta Promotion #5854

mm4tt · 2026-01-30T07:31:13Z

One-line PR description: Promoting KEP-4671 Gang Scheduling to Beta in 1.36

Issue link: Gang Scheduling Support in Kubernetes #4671

Other comments:

k8s-ci-robot · 2026-01-30T07:31:16Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

keps/sig-scheduling/4671-gang-scheduling/README.md

mm4tt · 2026-01-30T14:17:04Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+    specific group. 
+  - Events: Repeated `FailedScheduling` events on the Pods with workloadRef. 
+  - Mitigations: If the gang cannot fit due to resource constraints, delete the Workload object which should disable 
+    the gang-scheduling TODO(mm4tt@): Discuss with Wojtek 


@wojtek-t this is related to the opt-out discussion we need to have. PTAL and let me know your thoughts

macsko · 2026-01-30T14:22:56Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+    - `scheduler_pod_group_scheduling_attempts_total`
+    - `scheduler_pod_group_scheduling_duration_seconds`
+    - `scheduler_pod_group_scheduling_algorithm_duration_seconds`


Metric names look good

@soltysh - FYI regarding #5558 (comment)

These are matching what we were talking about 4 months ago.

keps/sig-scheduling/4671-gang-scheduling/README.md

dom4ha · 2026-01-30T14:33:07Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+5. Create a Workload object named gang-test with minCount=2.
+6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test.
+7. The Pod stays in Pending state (waiting for the gang). We verified that  
+   `scheduler_pod_group_scheduling_attempts_total` metric is incremented.


The metric should not be incremented as pods will be blocked on PreEnqueue, so workloads cycle should not be triggered yet.

Ok, is there any other metric / event that we can use here?

Can we just check "pending_pods" instead?

The pods are blocked, so will they be reported in the "gated" queue? Or "gated" is only for the ones with scheduling gates? @macsko

The pods are blocked, so will they be reported in the "gated" queue?

Right, the pods blocked on PreEnqueue will be counted as "gated" in "pending_pods" metric

dom4ha · 2026-01-30T14:36:01Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+11. Create test-pod-3 and test-pod-4 pointing to a workload. 
+12. The pods are scheduled immediately one-by-one (Workload logic is ignored/unavailable because the field is dropped). 
+13. Upgrade API Server and Scheduler back to v1.36. 
+14. Create new pods referencing a Workload; verifying that Gang Scheduling functionality is restored (pods wait for minCount before scheduling).


You mean that pod3 and pod4 won't be considered in calculating minCount, but only new pods will, is that correct?

What I had in mind is that we create a new workload with minCount=2 and pod3-4 pointing to it in step 11.
I don't want consider cases where minCount < podCount as it's semantically ambiguous.

However, now I realized we won't be able to create a new workload in step 11 because the API is disabled.

Updated the test scenario to fix that.

dom4ha · 2026-01-30T14:40:36Z

keps/sig-scheduling/4671-gang-scheduling/README.md

 ###### What steps should be taken if SLOs are not being met to determine the problem?

+1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and 
+   `scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling  


Not sure if scheduler_pod_group_scheduling_algorithm_duration_seconds (the one with algorithm) brings anything over the one without algorithm when there is no TAS nor WAS preemption yet.

Answered in the other comment, PTAL.

dom4ha · 2026-01-30T14:45:51Z

keps/sig-scheduling/4671-gang-scheduling/README.md

-      Not required until feature graduated to beta.
-    - Testing: Are there any tests for failure mode? If not, describe why.
-->
+- Pods Pending Indefinitely (Gang Starvation)


There should be two main cases:

Pods waiting in PreEnqueue until minCount is reached

Pods cannot be scheduled because minCount pods does not fit

In both cases we should have pod status set informing about the reason

Thanks, split into two cases.

dom4ha · 2026-01-30T14:48:19Z

keps/sig-scheduling/4671-gang-scheduling/kep.yaml

 metrics:
+  - scheduler_pod_group_scheduling_attempts_total
+  - scheduler_pod_group_scheduling_duration_seconds
+  - scheduler_pod_group_scheduling_algorithm_duration_seconds


This metric should not bring any new information over the metric above.

Are we saying we'll not be adding it?

IIUC, they serve different purposes, similar to standard pod scheduling metrics. scheduler_pod_group_scheduling_duration_seconds covers the end-to-end latency of the cycle (including queue operations, snapshotting, etc.), while _algorithm_duration_seconds measures strictly the core calculation time. Having both allows us to distinguish whether a potential regression is caused by the algorithm's complexity or by system overheads (like snapshotting or queue locking), doesn't it?

mm4tt

Thanks, PTAL

mm4tt · 2026-02-02T14:47:47Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+5. Create a Workload object named gang-test with minCount=2.
+6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test.
+7. The Pod stays in Pending state (waiting for the gang). We verified that  
+   `scheduler_pod_group_scheduling_attempts_total` metric is incremented.


Ok, is there any other metric / event that we can use here?

mm4tt · 2026-02-02T14:59:22Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+11. Create test-pod-3 and test-pod-4 pointing to a workload. 
+12. The pods are scheduled immediately one-by-one (Workload logic is ignored/unavailable because the field is dropped). 
+13. Upgrade API Server and Scheduler back to v1.36. 
+14. Create new pods referencing a Workload; verifying that Gang Scheduling functionality is restored (pods wait for minCount before scheduling).


What I had in mind is that we create a new workload with minCount=2 and pod3-4 pointing to it in step 11.
I don't want consider cases where minCount < podCount as it's semantically ambiguous.

However, now I realized we won't be able to create a new workload in step 11 because the API is disabled.

Updated the test scenario to fix that.

mm4tt · 2026-02-02T15:08:36Z

keps/sig-scheduling/4671-gang-scheduling/kep.yaml

 metrics:
+  - scheduler_pod_group_scheduling_attempts_total
+  - scheduler_pod_group_scheduling_duration_seconds
+  - scheduler_pod_group_scheduling_algorithm_duration_seconds


IIUC, they serve different purposes, similar to standard pod scheduling metrics. scheduler_pod_group_scheduling_duration_seconds covers the end-to-end latency of the cycle (including queue operations, snapshotting, etc.), while _algorithm_duration_seconds measures strictly the core calculation time. Having both allows us to distinguish whether a potential regression is caused by the algorithm's complexity or by system overheads (like snapshotting or queue locking), doesn't it?

mm4tt · 2026-02-02T15:09:59Z

keps/sig-scheduling/4671-gang-scheduling/README.md

 ###### What steps should be taken if SLOs are not being met to determine the problem?

+1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and 
+   `scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling  


Answered in the other comment, PTAL.

mm4tt · 2026-02-02T15:12:47Z

keps/sig-scheduling/4671-gang-scheduling/README.md

-      Not required until feature graduated to beta.
-    - Testing: Are there any tests for failure mode? If not, describe why.
-->
+- Pods Pending Indefinitely (Gang Starvation)


Thanks, split into two cases.

mm4tt · 2026-02-02T15:23:52Z

/assign @soltysh
/cc @soltysh

Hi @soltysh , could you please take a look at the PRR section for Beta promotion?

k8s-ci-robot · 2026-02-02T15:26:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mm4tt
Once this PR has been reviewed and has the lgtm label, please ask for approval from soltysh and additionally assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2026-02-03T11:29:06Z

keps/prod-readiness/sig-scheduling/4671.yaml

 alpha:
  approver: "@soltysh"
+beta:
+  approver: "@soltysh"


Can't comment on the lines that didn't change, so adding some comments here:

How can this feature be enabled / disabled in a live cluster?

As part of Beta, we will be adding two new feature gates (that may go directly to beta):

WorkloadSchedulingCycle - to gate the logic related to it

DelayedPreemption - to gate delayed preemption logic
We should reflect that in the answer

Does enabling the feature change any default behavior?

Technically, we will use delayed preemption also for pod-by-pod. It should be no-op from end-user perspective, but maybe it's worth adding it there too?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

This question needs adjustments to reflect the newly introduced gates - see above

Are there any tests for feature enablement/disablement?

Has this been implemented? If not, please ensure that we will not promote to beta without this test.
[Or maybe we can even implement it in the meantime?]

wojtek-t · 2026-02-03T11:36:10Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+5. Create a Workload object named gang-test with minCount=2.
+6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test.
+7. The Pod stays in Pending state (waiting for the gang). We verified that  
+   `scheduler_pod_group_scheduling_attempts_total` metric is incremented.


Can we just check "pending_pods" instead?

The pods are blocked, so will they be reported in the "gated" queue? Or "gated" is only for the ones with scheduling gates? @macsko

wojtek-t · 2026-02-03T11:37:54Z

keps/sig-scheduling/4671-gang-scheduling/README.md

-  - Details:
+- [x] API .spec
+  - Other field: workloadRef is set on the Pods.
+- [x] Events 


@macsko - are we going to have this event?

Anyway - I think this is misleading, because the lack of these events doesn't mean it's not working, so I would remove it anyway.

wojtek-t · 2026-02-03T11:39:08Z

keps/sig-scheduling/4671-gang-scheduling/README.md


-###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+- Scheduling Throughput: There should be no significant regression in the system-wide scheduling throughput (pods/s) 
+when scheduling pods attached to a Workload compared to scheduling an equivalent number of individual pods.


How would you measure it using existing metrics?

[I'm silently assumign that it's via API server metrics counting "/binding" calls, but would be good to clairfy.

wojtek-t · 2026-02-03T11:42:14Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+    - `scheduler_pod_group_scheduling_attempts_total`
+    - `scheduler_pod_group_scheduling_duration_seconds`
+    - `scheduler_pod_group_scheduling_algorithm_duration_seconds`


@soltysh - FYI regarding #5558 (comment)

These are matching what we were talking about 4 months ago.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jan 30, 2026

k8s-ci-robot requested review from dom4ha and macsko January 30, 2026 07:31

github-project-automation bot added this to SIG Scheduling Jan 30, 2026

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 30, 2026

mm4tt force-pushed the kep-4671-beta-promotion branch 3 times, most recently from f25e37a to 25677bc Compare January 30, 2026 07:40

wojtek-t reviewed Jan 30, 2026

View reviewed changes

keps/sig-scheduling/4671-gang-scheduling/README.md Outdated Show resolved Hide resolved

mm4tt force-pushed the kep-4671-beta-promotion branch 4 times, most recently from 8e713ff to 4f72681 Compare January 30, 2026 14:15

4671 Beta Promotion - PRR v1

d4bef89

mm4tt force-pushed the kep-4671-beta-promotion branch from 4f72681 to d4bef89 Compare January 30, 2026 14:16

mm4tt commented Jan 30, 2026

View reviewed changes

macsko reviewed Jan 30, 2026

View reviewed changes

dom4ha reviewed Jan 30, 2026

View reviewed changes

Addressing comments kubernetes#1

43927c7

mm4tt commented Feb 2, 2026

View reviewed changes

mm4tt marked this pull request as ready for review February 2, 2026 15:22

k8s-ci-robot requested review from dom4ha and macsko February 2, 2026 15:22

k8s-ci-robot assigned soltysh Feb 2, 2026

k8s-ci-robot requested a review from soltysh February 2, 2026 15:23

Update production readinesss beta approver

2441d75

helayoty mentioned this pull request Feb 2, 2026

KEP-5547: Integrate Workload APIs with Job controller #5871

Open

wojtek-t reviewed Feb 3, 2026

View reviewed changes

wojtek-t self-assigned this Feb 3, 2026

dom4ha mentioned this pull request Feb 3, 2026

Gang Scheduling Support in Kubernetes #4671

Open

16 tasks

[WIP] KEP-4671: Beta Promotion #5854

Are you sure you want to change the base?

[WIP] KEP-4671: Beta Promotion #5854

Conversation

mm4tt commented Jan 30, 2026

Uh oh!

k8s-ci-robot commented Jan 30, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mm4tt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mm4tt commented Feb 2, 2026

Uh oh!

k8s-ci-robot commented Feb 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants