Skip to content

Conversation

@mm4tt
Copy link

@mm4tt mm4tt commented Jan 30, 2026

  • One-line PR description: Promoting KEP-4671 Gang Scheduling to Beta in 1.36
  • Other comments:

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jan 30, 2026
@k8s-ci-robot k8s-ci-robot requested review from dom4ha and macsko January 30, 2026 07:31
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 30, 2026
@mm4tt mm4tt force-pushed the kep-4671-beta-promotion branch 3 times, most recently from f25e37a to 25677bc Compare January 30, 2026 07:40
@mm4tt mm4tt force-pushed the kep-4671-beta-promotion branch 4 times, most recently from 8e713ff to 4f72681 Compare January 30, 2026 14:15
@mm4tt mm4tt force-pushed the kep-4671-beta-promotion branch from 4f72681 to d4bef89 Compare January 30, 2026 14:16
specific group.
- Events: Repeated `FailedScheduling` events on the Pods with workloadRef.
- Mitigations: If the gang cannot fit due to resource constraints, delete the Workload object which should disable
the gang-scheduling TODO(mm4tt@): Discuss with Wojtek
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t this is related to the opt-out discussion we need to have. PTAL and let me know your thoughts

Comment on lines +867 to +869
- `scheduler_pod_group_scheduling_attempts_total`
- `scheduler_pod_group_scheduling_duration_seconds`
- `scheduler_pod_group_scheduling_algorithm_duration_seconds`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric names look good

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh - FYI regarding #5558 (comment)

These are matching what we were talking about 4 months ago.

5. Create a Workload object named gang-test with minCount=2.
6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test.
7. The Pod stays in Pending state (waiting for the gang). We verified that
`scheduler_pod_group_scheduling_attempts_total` metric is incremented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric should not be incremented as pods will be blocked on PreEnqueue, so workloads cycle should not be triggered yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, is there any other metric / event that we can use here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just check "pending_pods" instead?

The pods are blocked, so will they be reported in the "gated" queue? Or "gated" is only for the ones with scheduling gates? @macsko

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pods are blocked, so will they be reported in the "gated" queue?

Right, the pods blocked on PreEnqueue will be counted as "gated" in "pending_pods" metric

11. Create test-pod-3 and test-pod-4 pointing to a workload.
12. The pods are scheduled immediately one-by-one (Workload logic is ignored/unavailable because the field is dropped).
13. Upgrade API Server and Scheduler back to v1.36.
14. Create new pods referencing a Workload; verifying that Gang Scheduling functionality is restored (pods wait for minCount before scheduling).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that pod3 and pod4 won't be considered in calculating minCount, but only new pods will, is that correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I had in mind is that we create a new workload with minCount=2 and pod3-4 pointing to it in step 11.
I don't want consider cases where minCount < podCount as it's semantically ambiguous.

However, now I realized we won't be able to create a new workload in step 11 because the API is disabled.

Updated the test scenario to fix that.

###### What steps should be taken if SLOs are not being met to determine the problem?

1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and
`scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if scheduler_pod_group_scheduling_algorithm_duration_seconds (the one with algorithm) brings anything over the one without algorithm when there is no TAS nor WAS preemption yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered in the other comment, PTAL.

Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
- Pods Pending Indefinitely (Gang Starvation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be two main cases:

  1. Pods waiting in PreEnqueue until minCount is reached
  2. Pods cannot be scheduled because minCount pods does not fit

In both cases we should have pod status set informing about the reason

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, split into two cases.

metrics:
- scheduler_pod_group_scheduling_attempts_total
- scheduler_pod_group_scheduling_duration_seconds
- scheduler_pod_group_scheduling_algorithm_duration_seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric should not bring any new information over the metric above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we saying we'll not be adding it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, they serve different purposes, similar to standard pod scheduling metrics. scheduler_pod_group_scheduling_duration_seconds covers the end-to-end latency of the cycle (including queue operations, snapshotting, etc.), while _algorithm_duration_seconds measures strictly the core calculation time. Having both allows us to distinguish whether a potential regression is caused by the algorithm's complexity or by system overheads (like snapshotting or queue locking), doesn't it?

Copy link
Author

@mm4tt mm4tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, PTAL

5. Create a Workload object named gang-test with minCount=2.
6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test.
7. The Pod stays in Pending state (waiting for the gang). We verified that
`scheduler_pod_group_scheduling_attempts_total` metric is incremented.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, is there any other metric / event that we can use here?

11. Create test-pod-3 and test-pod-4 pointing to a workload.
12. The pods are scheduled immediately one-by-one (Workload logic is ignored/unavailable because the field is dropped).
13. Upgrade API Server and Scheduler back to v1.36.
14. Create new pods referencing a Workload; verifying that Gang Scheduling functionality is restored (pods wait for minCount before scheduling).
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I had in mind is that we create a new workload with minCount=2 and pod3-4 pointing to it in step 11.
I don't want consider cases where minCount < podCount as it's semantically ambiguous.

However, now I realized we won't be able to create a new workload in step 11 because the API is disabled.

Updated the test scenario to fix that.

metrics:
- scheduler_pod_group_scheduling_attempts_total
- scheduler_pod_group_scheduling_duration_seconds
- scheduler_pod_group_scheduling_algorithm_duration_seconds
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, they serve different purposes, similar to standard pod scheduling metrics. scheduler_pod_group_scheduling_duration_seconds covers the end-to-end latency of the cycle (including queue operations, snapshotting, etc.), while _algorithm_duration_seconds measures strictly the core calculation time. Having both allows us to distinguish whether a potential regression is caused by the algorithm's complexity or by system overheads (like snapshotting or queue locking), doesn't it?

###### What steps should be taken if SLOs are not being met to determine the problem?

1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and
`scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered in the other comment, PTAL.

Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
- Pods Pending Indefinitely (Gang Starvation)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, split into two cases.

@mm4tt mm4tt marked this pull request as ready for review February 2, 2026 15:22
@k8s-ci-robot k8s-ci-robot requested review from dom4ha and macsko February 2, 2026 15:22
@mm4tt
Copy link
Author

mm4tt commented Feb 2, 2026

/assign @soltysh
/cc @soltysh

Hi @soltysh , could you please take a look at the PRR section for Beta promotion?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mm4tt
Once this PR has been reviewed and has the lgtm label, please ask for approval from soltysh and additionally assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alpha:
approver: "@soltysh"
beta:
approver: "@soltysh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment on the lines that didn't change, so adding some comments here:

How can this feature be enabled / disabled in a live cluster?

As part of Beta, we will be adding two new feature gates (that may go directly to beta):

  1. WorkloadSchedulingCycle - to gate the logic related to it
  2. DelayedPreemption - to gate delayed preemption logic
    We should reflect that in the answer

Does enabling the feature change any default behavior?

Technically, we will use delayed preemption also for pod-by-pod. It should be no-op from end-user perspective, but maybe it's worth adding it there too?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

This question needs adjustments to reflect the newly introduced gates - see above

Are there any tests for feature enablement/disablement?

Has this been implemented? If not, please ensure that we will not promote to beta without this test.
[Or maybe we can even implement it in the meantime?]

5. Create a Workload object named gang-test with minCount=2.
6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test.
7. The Pod stays in Pending state (waiting for the gang). We verified that
`scheduler_pod_group_scheduling_attempts_total` metric is incremented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just check "pending_pods" instead?

The pods are blocked, so will they be reported in the "gated" queue? Or "gated" is only for the ones with scheduling gates? @macsko

- Details:
- [x] API .spec
- Other field: workloadRef is set on the Pods.
- [x] Events
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@macsko - are we going to have this event?

Anyway - I think this is misleading, because the lack of these events doesn't mean it's not working, so I would remove it anyway.


###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Scheduling Throughput: There should be no significant regression in the system-wide scheduling throughput (pods/s)
when scheduling pods attached to a Workload compared to scheduling an equivalent number of individual pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you measure it using existing metrics?

[I'm silently assumign that it's via API server metrics counting "/binding" calls, but would be good to clairfy.

Comment on lines +867 to +869
- `scheduler_pod_group_scheduling_attempts_total`
- `scheduler_pod_group_scheduling_duration_seconds`
- `scheduler_pod_group_scheduling_algorithm_duration_seconds`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh - FYI regarding #5558 (comment)

These are matching what we were talking about 4 months ago.

@wojtek-t wojtek-t self-assigned this Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants