Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-scheduling/4671.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 4671
alpha:
approver: "@soltysh"
beta:
approver: "@soltysh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment on the lines that didn't change, so adding some comments here:

How can this feature be enabled / disabled in a live cluster?

As part of Beta, we will be adding two new feature gates (that may go directly to beta):

  1. WorkloadSchedulingCycle - to gate the logic related to it
  2. DelayedPreemption - to gate delayed preemption logic
    We should reflect that in the answer

Does enabling the feature change any default behavior?

Technically, we will use delayed preemption also for pod-by-pod. It should be no-op from end-user perspective, but maybe it's worth adding it there too?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

This question needs adjustments to reflect the newly introduced gates - see above

Are there any tests for feature enablement/disablement?

Has this been implemented? If not, please ensure that we will not promote to beta without this test.
[Or maybe we can even implement it in the meantime?]

172 changes: 104 additions & 68 deletions keps/sig-scheduling/4671-gang-scheduling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -766,30 +766,52 @@ This section must be completed when targeting beta to a release.

###### How can a rollout or rollback fail? Can it impact already running workloads?

<!--
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
Workloads that do not use the Workload API should not be impacted, since the functionality remains unchanged for them.
During a rolling upgrade, if the active scheduler instance has the feature disabled, it will schedule pods using the
standard pod-by-pod method. This results in a fallback to the status quo behavior, meaning that pods will be still
scheduled, but PodGroup level scheduling constraints won't be applied.

Be sure to consider highly-available clusters, where, for example,
feature flags will be enabled on some API servers and not others during the
rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->
The worst-case scenario is a critical bug in the new Workload Scheduling Cycle code causing a scheduler crash-loop.
This would stop all scheduling but would not impact already running workloads and rollback is a sufficient
mitigation method.

###### What specific metrics should inform a rollback?

<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->
- `scheduler_schedule_attempts_total{result="error"}`: A sudden spike indicates internal errors or panics within
the scheduling loop, possibly caused by the new logic.
- `process_start_time_seconds`: Frequent resets of this metric indicate that the scheduler process is crashing and
restarting (crash loop).
- `scheduler_pod_scheduling_duration_seconds`: A significant regression in P99 latency for standard (non-gang) pods
would indicate that the overhead of the new logic is unacceptable.
- `scheduler_pod_group_scheduling_attempts_total` (new metric, TODO: check with Maciek on the metric name): Consistently
high failure rates for valid gangs compared to successful attempts.
- `scheduler_preemption_attempts_total`, `scheduler_preemption_victims`: A sudden increase might indicate that the
new "delayed preemption" logic is malfunctioning (e.g., triggering unnecessary preemptions).

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
We'll perform manual testing of the upgrade -> downgrade -> upgrade path using the following sequence:

1. Start a local Kubernetes v1.35 cluster with GenericWorkload and GangScheduling feature gates disabled (default
behavior).
2. Attempt to create a Pod with `spec.workloadRef` set.
3. The `spec.workloadRef` field is dropped by the API server. The pod is created successfully but without the workload
reference, resulting in immediate standard scheduling (one-by-one).
4. Restart/Upgrade API Server and Scheduler to v1.36 (with feature gates enabled).
5. Create two Workload objects: `gang-test-A` and `gang-test-B` (both with `minCount=2`).
6. Create a Pod `test-pod-1` with `spec.workloadRef` pointing to `gang-test-A`.
7. The Pod stays in `Pending` state (waiting for the gang). We verified that
`scheduler_pod_group_scheduling_attempts_total` metric is incremented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric should not be incremented as pods will be blocked on PreEnqueue, so workloads cycle should not be triggered yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, is there any other metric / event that we can use here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just check "pending_pods" instead?

The pods are blocked, so will they be reported in the "gated" queue? Or "gated" is only for the ones with scheduling gates? @macsko

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pods are blocked, so will they be reported in the "gated" queue?

Right, the pods blocked on PreEnqueue will be counted as "gated" in "pending_pods" metric

8. Create a Pod `test-pod-2` pointing to the same workload.
9. Both pods are scheduled successfully in the same cycle (Gang Scheduling works).
10. Downgrade API Server and Scheduler to v1.35 (with feature gates disabled).
11. Create `test-pod-3` pointing to `gang-test-B`. Note: We use a workload created in step 5 because creating new
Workload objects is disabled.
12. The pod is scheduled immediately (Workload logic is ignored because the workloadRef field is dropped by
the v1.35 API server). If Gang Scheduling were active, this pod would hang pending waiting for a second member.
13. Upgrade API Server and Scheduler back to v1.36 (feature gates enabled).
14. Create `test-pod-4` and `test-pod-5` pointing to `gang-test-B`; verifying that Gang Scheduling functionality is
restored (these pods wait for `minCount=2` before scheduling).

<!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Expand All @@ -806,11 +828,11 @@ previous answers based on experience in the field.

###### How can an operator determine if the feature is in use by workloads?

<!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->
Operators can check the new `scheduler_pod_group_scheduling_attempts_total` metric. A value greater than zero
indicates that the scheduler is processing Workload Scheduling Cycles.

Alternatively, checking for the existence of `Workload` via `kubectl get workloads` or checking the
`pod.spec.workloadRef` field confirms that users are actively using the feature.

###### How can someone using this feature know that it is working for their instance?

Expand All @@ -823,50 +845,38 @@ and operation of this feature.
Recall that end users cannot usually observe component logs or access metrics.
-->

- [ ] Events
- Event Reason:
- [ ] API .status
- Condition name:
- Other field:
- [ ] Other (treat as last resort)
- Details:
- [x] API .spec
- Other field: workloadRef is set on the Pods.
- [x] Events
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@macsko - are we going to have this event?

Anyway - I think this is misleading, because the lack of these events doesn't mean it's not working, so I would remove it anyway.

- Event Type: Warning
- Event Reason: FailedScheduling
- Event Message: The message includes details if the scheduling failed due to gang constraints (e.g., "pod group
minCount requirement not met").

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

<!--
This is your opportunity to define what "normal" quality of service looks like
for a feature.

It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors <= 1%
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job <= 10%
- 99.9% of /health requests per day finish with 200 code

These goals will help you determine what you need to measure (SLIs) in the next
question.
-->
Since there are no formal SLOs for the kube-scheduler apart from scalability SLOs, we define the objectives for this
feature primarily in terms of non-regression to ensure the workload scheduling does not degrade the performance of the
standard scheduling loop.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Scheduling Throughput: There should be no significant regression in the system-wide scheduling throughput (pods/s)
when scheduling pods attached to a Workload compared to scheduling an equivalent number of individual pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you measure it using existing metrics?

[I'm silently assumign that it's via API server metrics counting "/binding" calls, but would be good to clairfy.


<!--
Pick one more of these and delete the rest.
-->
- Scheduling Latency: There should be no significant regression in pod scheduling latency
(`scheduler_pod_scheduling_duration_seconds`) for both workload and non-workload pods compared to the baseline.

- [ ] Metrics
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

- [x] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- [ ] Other (treat as last resort)
- Details:
- `scheduler_pod_group_scheduling_attempts_total`
- `scheduler_pod_group_scheduling_duration_seconds`
- `scheduler_pod_group_scheduling_algorithm_duration_seconds`
Comment on lines +872 to +874
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric names look good

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh - FYI regarding #5558 (comment)

These are matching what we were talking about 4 months ago.

- Components exposing the metric: kube-scheduler

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

<!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
No.

### Dependencies

Expand Down Expand Up @@ -945,23 +955,49 @@ details). For now, we leave it here.

###### How does this feature react if the API server and/or etcd is unavailable?

The behavior is consistent with the status quo. Since the scheduler cannot bind pods or update statuses without the
API server, any in-flight workload scheduling will eventually fail at the binding/update stage. These attempts will be
retried with standard exponential backoff once connectivity is restored.

###### What are other known failure modes?

<!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
- Pods Pending Indefinitely - Waiting for Gang Assembly (PreEnqueue)
- Detection:
- Check Pod Events/Status. Expected reason: a message indicating that the pod is waiting for more gang members.
- The number of pending pods belonging to the group is less than minCount.
- Mitigations:
- Ensure the controller created all required pods.
- If intended, delete the Workload object to disable gang scheduling (fallback to best-effort scheduling) if
acceptable.
- Diagnostics:
- Scheduler logs at V=4 searching for "workload" to trace the decision flow.
- Verify minCount in the Workload matches the number of pods created by the Job/Controller.
- Testing:
- Covered by integration tests submitting partial gangs.
- Pods Pending Indefinitely - Gang cannot fit (Resource Constraints)
- Detection: Check Pod Events/Status. Expected reason: a message indicating that minCount pods could not be
scheduled.
- Metrics: `scheduler_pod_group_scheduling_attempts_total` with result unschedulable.
- Mitigations:
- Scale up the cluster (add nodes) or delete other real-workloads to free up space.
- If intended, delete the Workload object to disable gang scheduling (fallback to best-effort scheduling) if
acceptable.
- Diagnostics:
- Scheduler logs at V=4 searching for "workload" to see detailed reasons why the placement failed.
- Testing:
- Covered by integration tests submitting gangs larger than cluster capacity.

###### What steps should be taken if SLOs are not being met to determine the problem?

1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and
`scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if scheduler_pod_group_scheduling_algorithm_duration_seconds (the one with algorithm) brings anything over the one without algorithm when there is no TAS nor WAS preemption yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered in the other comment, PTAL.

Cycle logic itself is computationally expensive and causing the regression.
2. Inspect Logs: Enable scheduler logging at V=4 to trace the execution time of individual Workload Scheduling
Cycles and identify if specific large gangs are blocking the queue.
3. Disable Feature: If the regression is critical and impacting cluster health, disable the GangScheduling feature
gate. This will revert the scheduler to the standard pod-by-pod logic, restoring baseline performance (at the
cost of losing gang semantics).

## Implementation History

<!--
Expand Down
8 changes: 6 additions & 2 deletions keps/sig-scheduling/4671-gang-scheduling/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ authors:
- "@44past4"
- "@andreyvelich"
- "@thockin"
- "@mm4tt"

owning-sig: sig-scheduling
participating-sigs:
Expand All @@ -27,12 +28,12 @@ replaces:
# The target maturity stage in the current dev cycle for this KEP.
# If the purpose of this KEP is to deprecate a user-visible feature
# and a Deprecated feature gates are added, they should be deprecated|disabled|removed.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.35"
latest-milestone: "v1.36"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
Expand All @@ -54,3 +55,6 @@ disable-supported: true

# The following PRR answers are required at beta release
metrics:
- scheduler_pod_group_scheduling_attempts_total
- scheduler_pod_group_scheduling_duration_seconds
- scheduler_pod_group_scheduling_algorithm_duration_seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric should not bring any new information over the metric above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we saying we'll not be adding it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, they serve different purposes, similar to standard pod scheduling metrics. scheduler_pod_group_scheduling_duration_seconds covers the end-to-end latency of the cycle (including queue operations, snapshotting, etc.), while _algorithm_duration_seconds measures strictly the core calculation time. Having both allows us to distinguish whether a potential regression is caused by the algorithm's complexity or by system overheads (like snapshotting or queue locking), doesn't it?