-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[WIP] KEP-4671: Beta Promotion #5854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,5 @@ | ||
| kep-number: 4671 | ||
| alpha: | ||
| approver: "@soltysh" | ||
| beta: | ||
| approver: "@soltysh" | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -766,30 +766,52 @@ This section must be completed when targeting beta to a release. | |
|
|
||
| ###### How can a rollout or rollback fail? Can it impact already running workloads? | ||
|
|
||
| <!-- | ||
| Try to be as paranoid as possible - e.g., what if some components will restart | ||
| mid-rollout? | ||
| Workloads that do not use the Workload API should not be impacted, since the functionality remains unchanged for them. | ||
| During a rolling upgrade, if the active scheduler instance has the feature disabled, it will schedule pods using the | ||
| standard pod-by-pod method. This results in a fallback to the status quo behavior, meaning that pods will be still | ||
| scheduled, but PodGroup level scheduling constraints won't be applied. | ||
|
|
||
| Be sure to consider highly-available clusters, where, for example, | ||
| feature flags will be enabled on some API servers and not others during the | ||
| rollout. Similarly, consider large clusters and how enablement/disablement | ||
| will rollout across nodes. | ||
| --> | ||
| The worst-case scenario is a critical bug in the new Workload Scheduling Cycle code causing a scheduler crash-loop. | ||
| This would stop all scheduling but would not impact already running workloads and rollback is a sufficient | ||
| mitigation method. | ||
|
|
||
| ###### What specific metrics should inform a rollback? | ||
|
|
||
| <!-- | ||
| What signals should users be paying attention to when the feature is young | ||
| that might indicate a serious problem? | ||
| --> | ||
| - `scheduler_schedule_attempts_total{result="error"}`: A sudden spike indicates internal errors or panics within | ||
| the scheduling loop, possibly caused by the new logic. | ||
| - `process_start_time_seconds`: Frequent resets of this metric indicate that the scheduler process is crashing and | ||
| restarting (crash loop). | ||
| - `scheduler_pod_scheduling_duration_seconds`: A significant regression in P99 latency for standard (non-gang) pods | ||
| would indicate that the overhead of the new logic is unacceptable. | ||
| - `scheduler_pod_group_scheduling_attempts_total` (new metric, TODO: check with Maciek on the metric name): Consistently | ||
| high failure rates for valid gangs compared to successful attempts. | ||
| - `scheduler_preemption_attempts_total`, `scheduler_preemption_victims`: A sudden increase might indicate that the | ||
| new "delayed preemption" logic is malfunctioning (e.g., triggering unnecessary preemptions). | ||
|
|
||
| ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? | ||
| We'll perform manual testing of the upgrade -> downgrade -> upgrade path using the following sequence: | ||
|
|
||
| 1. Start a local Kubernetes v1.35 cluster with GenericWorkload and GangScheduling feature gates disabled (default | ||
| behavior). | ||
| 2. Attempt to create a Pod with `spec.workloadRef` set. | ||
| 3. The `spec.workloadRef` field is dropped by the API server. The pod is created successfully but without the workload | ||
| reference, resulting in immediate standard scheduling (one-by-one). | ||
| 4. Restart/Upgrade API Server and Scheduler to v1.36 (with feature gates enabled). | ||
| 5. Create two Workload objects: `gang-test-A` and `gang-test-B` (both with `minCount=2`). | ||
| 6. Create a Pod `test-pod-1` with `spec.workloadRef` pointing to `gang-test-A`. | ||
| 7. The Pod stays in `Pending` state (waiting for the gang). We verified that | ||
| `scheduler_pod_group_scheduling_attempts_total` metric is incremented. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The metric should not be incremented as pods will be blocked on PreEnqueue, so workloads cycle should not be triggered yet.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, is there any other metric / event that we can use here?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we just check "pending_pods" instead? The pods are blocked, so will they be reported in the "gated" queue? Or "gated" is only for the ones with scheduling gates? @macsko
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right, the pods blocked on PreEnqueue will be counted as "gated" in "pending_pods" metric |
||
| 8. Create a Pod `test-pod-2` pointing to the same workload. | ||
| 9. Both pods are scheduled successfully in the same cycle (Gang Scheduling works). | ||
| 10. Downgrade API Server and Scheduler to v1.35 (with feature gates disabled). | ||
| 11. Create `test-pod-3` pointing to `gang-test-B`. Note: We use a workload created in step 5 because creating new | ||
| Workload objects is disabled. | ||
| 12. The pod is scheduled immediately (Workload logic is ignored because the workloadRef field is dropped by | ||
| the v1.35 API server). If Gang Scheduling were active, this pod would hang pending waiting for a second member. | ||
| 13. Upgrade API Server and Scheduler back to v1.36 (feature gates enabled). | ||
| 14. Create `test-pod-4` and `test-pod-5` pointing to `gang-test-B`; verifying that Gang Scheduling functionality is | ||
| restored (these pods wait for `minCount=2` before scheduling). | ||
|
|
||
| <!-- | ||
| Describe manual testing that was done and the outcomes. | ||
| Longer term, we may want to require automated upgrade/rollback tests, but we | ||
| are missing a bunch of machinery and tooling and can't do that now. | ||
| --> | ||
|
|
||
| ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? | ||
|
|
||
|
|
@@ -806,11 +828,11 @@ previous answers based on experience in the field. | |
|
|
||
| ###### How can an operator determine if the feature is in use by workloads? | ||
|
|
||
| <!-- | ||
| Ideally, this should be a metric. Operations against the Kubernetes API (e.g., | ||
| checking if there are objects with field X set) may be a last resort. Avoid | ||
| logs or events for this purpose. | ||
| --> | ||
| Operators can check the new `scheduler_pod_group_scheduling_attempts_total` metric. A value greater than zero | ||
| indicates that the scheduler is processing Workload Scheduling Cycles. | ||
|
|
||
| Alternatively, checking for the existence of `Workload` via `kubectl get workloads` or checking the | ||
| `pod.spec.workloadRef` field confirms that users are actively using the feature. | ||
|
|
||
| ###### How can someone using this feature know that it is working for their instance? | ||
|
|
||
|
|
@@ -823,50 +845,38 @@ and operation of this feature. | |
| Recall that end users cannot usually observe component logs or access metrics. | ||
| --> | ||
|
|
||
| - [ ] Events | ||
| - Event Reason: | ||
| - [ ] API .status | ||
| - Condition name: | ||
| - Other field: | ||
| - [ ] Other (treat as last resort) | ||
| - Details: | ||
| - [x] API .spec | ||
| - Other field: workloadRef is set on the Pods. | ||
| - [x] Events | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @macsko - are we going to have this event? Anyway - I think this is misleading, because the lack of these events doesn't mean it's not working, so I would remove it anyway. |
||
| - Event Type: Warning | ||
| - Event Reason: FailedScheduling | ||
| - Event Message: The message includes details if the scheduling failed due to gang constraints (e.g., "pod group | ||
| minCount requirement not met"). | ||
|
|
||
| ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? | ||
|
|
||
| <!-- | ||
| This is your opportunity to define what "normal" quality of service looks like | ||
| for a feature. | ||
|
|
||
| It's impossible to provide comprehensive guidance, but at the very | ||
| high level (needs more precise definitions) those may be things like: | ||
| - per-day percentage of API calls finishing with 5XX errors <= 1% | ||
| - 99% percentile over day of absolute value from (job creation time minus expected | ||
| job creation time) for cron job <= 10% | ||
| - 99.9% of /health requests per day finish with 200 code | ||
|
|
||
| These goals will help you determine what you need to measure (SLIs) in the next | ||
| question. | ||
| --> | ||
| Since there are no formal SLOs for the kube-scheduler apart from scalability SLOs, we define the objectives for this | ||
| feature primarily in terms of non-regression to ensure the workload scheduling does not degrade the performance of the | ||
| standard scheduling loop. | ||
|
|
||
| ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | ||
| - Scheduling Throughput: There should be no significant regression in the system-wide scheduling throughput (pods/s) | ||
| when scheduling pods attached to a Workload compared to scheduling an equivalent number of individual pods. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How would you measure it using existing metrics? [I'm silently assumign that it's via API server metrics counting "/binding" calls, but would be good to clairfy. |
||
|
|
||
| <!-- | ||
| Pick one more of these and delete the rest. | ||
| --> | ||
| - Scheduling Latency: There should be no significant regression in pod scheduling latency | ||
| (`scheduler_pod_scheduling_duration_seconds`) for both workload and non-workload pods compared to the baseline. | ||
|
|
||
| - [ ] Metrics | ||
| ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | ||
|
|
||
| - [x] Metrics | ||
| - Metric name: | ||
| - [Optional] Aggregation method: | ||
| - Components exposing the metric: | ||
| - [ ] Other (treat as last resort) | ||
| - Details: | ||
| - `scheduler_pod_group_scheduling_attempts_total` | ||
| - `scheduler_pod_group_scheduling_duration_seconds` | ||
| - `scheduler_pod_group_scheduling_algorithm_duration_seconds` | ||
|
Comment on lines
+872
to
+874
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Metric names look good
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @soltysh - FYI regarding #5558 (comment) These are matching what we were talking about 4 months ago. |
||
| - Components exposing the metric: kube-scheduler | ||
|
|
||
| ###### Are there any missing metrics that would be useful to have to improve observability of this feature? | ||
|
|
||
| <!-- | ||
| Describe the metrics themselves and the reasons why they weren't added (e.g., cost, | ||
| implementation difficulties, etc.). | ||
| --> | ||
| No. | ||
|
|
||
| ### Dependencies | ||
|
|
||
|
|
@@ -945,23 +955,49 @@ details). For now, we leave it here. | |
|
|
||
| ###### How does this feature react if the API server and/or etcd is unavailable? | ||
|
|
||
| The behavior is consistent with the status quo. Since the scheduler cannot bind pods or update statuses without the | ||
| API server, any in-flight workload scheduling will eventually fail at the binding/update stage. These attempts will be | ||
| retried with standard exponential backoff once connectivity is restored. | ||
|
|
||
| ###### What are other known failure modes? | ||
|
|
||
| <!-- | ||
| For each of them, fill in the following information by copying the below template: | ||
| - [Failure mode brief description] | ||
| - Detection: How can it be detected via metrics? Stated another way: | ||
| how can an operator troubleshoot without logging into a master or worker node? | ||
| - Mitigations: What can be done to stop the bleeding, especially for already | ||
| running user workloads? | ||
| - Diagnostics: What are the useful log messages and their required logging | ||
| levels that could help debug the issue? | ||
| Not required until feature graduated to beta. | ||
| - Testing: Are there any tests for failure mode? If not, describe why. | ||
| --> | ||
| - Pods Pending Indefinitely - Waiting for Gang Assembly (PreEnqueue) | ||
| - Detection: | ||
| - Check Pod Events/Status. Expected reason: a message indicating that the pod is waiting for more gang members. | ||
| - The number of pending pods belonging to the group is less than minCount. | ||
| - Mitigations: | ||
| - Ensure the controller created all required pods. | ||
| - If intended, delete the Workload object to disable gang scheduling (fallback to best-effort scheduling) if | ||
| acceptable. | ||
| - Diagnostics: | ||
| - Scheduler logs at V=4 searching for "workload" to trace the decision flow. | ||
| - Verify minCount in the Workload matches the number of pods created by the Job/Controller. | ||
| - Testing: | ||
| - Covered by integration tests submitting partial gangs. | ||
| - Pods Pending Indefinitely - Gang cannot fit (Resource Constraints) | ||
| - Detection: Check Pod Events/Status. Expected reason: a message indicating that minCount pods could not be | ||
| scheduled. | ||
| - Metrics: `scheduler_pod_group_scheduling_attempts_total` with result unschedulable. | ||
| - Mitigations: | ||
| - Scale up the cluster (add nodes) or delete other real-workloads to free up space. | ||
| - If intended, delete the Workload object to disable gang scheduling (fallback to best-effort scheduling) if | ||
| acceptable. | ||
| - Diagnostics: | ||
| - Scheduler logs at V=4 searching for "workload" to see detailed reasons why the placement failed. | ||
| - Testing: | ||
| - Covered by integration tests submitting gangs larger than cluster capacity. | ||
|
|
||
| ###### What steps should be taken if SLOs are not being met to determine the problem? | ||
|
|
||
| 1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and | ||
| `scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if scheduler_pod_group_scheduling_algorithm_duration_seconds (the one with algorithm) brings anything over the one without algorithm when there is no TAS nor WAS preemption yet.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Answered in the other comment, PTAL. |
||
| Cycle logic itself is computationally expensive and causing the regression. | ||
| 2. Inspect Logs: Enable scheduler logging at V=4 to trace the execution time of individual Workload Scheduling | ||
| Cycles and identify if specific large gangs are blocking the queue. | ||
| 3. Disable Feature: If the regression is critical and impacting cluster health, disable the GangScheduling feature | ||
| gate. This will revert the scheduler to the standard pod-by-pod logic, restoring baseline performance (at the | ||
| cost of losing gang semantics). | ||
|
|
||
| ## Implementation History | ||
|
|
||
| <!-- | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,6 +8,7 @@ authors: | |
| - "@44past4" | ||
| - "@andreyvelich" | ||
| - "@thockin" | ||
| - "@mm4tt" | ||
|
|
||
| owning-sig: sig-scheduling | ||
| participating-sigs: | ||
|
|
@@ -27,12 +28,12 @@ replaces: | |
| # The target maturity stage in the current dev cycle for this KEP. | ||
| # If the purpose of this KEP is to deprecate a user-visible feature | ||
| # and a Deprecated feature gates are added, they should be deprecated|disabled|removed. | ||
| stage: alpha | ||
| stage: beta | ||
|
|
||
| # The most recent milestone for which work toward delivery of this KEP has been | ||
| # done. This can be the current (upcoming) milestone, if it is being actively | ||
| # worked on. | ||
| latest-milestone: "v1.35" | ||
| latest-milestone: "v1.36" | ||
|
|
||
| # The milestone at which this feature was, or is targeted to be, at each stage. | ||
| milestone: | ||
|
|
@@ -54,3 +55,6 @@ disable-supported: true | |
|
|
||
| # The following PRR answers are required at beta release | ||
| metrics: | ||
| - scheduler_pod_group_scheduling_attempts_total | ||
| - scheduler_pod_group_scheduling_duration_seconds | ||
| - scheduler_pod_group_scheduling_algorithm_duration_seconds | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This metric should not bring any new information over the metric above.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we saying we'll not be adding it?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC, they serve different purposes, similar to standard pod scheduling metrics. scheduler_pod_group_scheduling_duration_seconds covers the end-to-end latency of the cycle (including queue operations, snapshotting, etc.), while _algorithm_duration_seconds measures strictly the core calculation time. Having both allows us to distinguish whether a potential regression is caused by the algorithm's complexity or by system overheads (like snapshotting or queue locking), doesn't it? |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't comment on the lines that didn't change, so adding some comments here:
As part of Beta, we will be adding two new feature gates (that may go directly to beta):
We should reflect that in the answer
Technically, we will use delayed preemption also for pod-by-pod. It should be no-op from end-user perspective, but maybe it's worth adding it there too?
This question needs adjustments to reflect the newly introduced gates - see above
Has this been implemented? If not, please ensure that we will not promote to beta without this test.
[Or maybe we can even implement it in the meantime?]