kubernetes · mm4tt · Jan 29, 2026 · Feb 2, 2026 · Feb 2, 2026 · wojtek-t
diff --git a/keps/prod-readiness/sig-scheduling/4671.yaml b/keps/prod-readiness/sig-scheduling/4671.yaml
@@ -1,3 +1,5 @@
 kep-number: 4671
 alpha:
   approver: "@soltysh"
+beta:
+  approver: "@soltysh"
diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -766,30 +766,52 @@ This section must be completed when targeting beta to a release.
 
 ###### How can a rollout or rollback fail? Can it impact already running workloads?
 
-<!--
-Try to be as paranoid as possible - e.g., what if some components will restart
-mid-rollout?
+Workloads that do not use the Workload API should not be impacted, since the functionality remains unchanged for them. 
+During a rolling upgrade, if the active scheduler instance has the feature disabled, it will schedule pods using the 
+standard pod-by-pod method. This results in a fallback to the status quo behavior, meaning that pods will be still
+scheduled, but PodGroup level scheduling constraints won't be applied.
 
-Be sure to consider highly-available clusters, where, for example,
-feature flags will be enabled on some API servers and not others during the
-rollout. Similarly, consider large clusters and how enablement/disablement
-will rollout across nodes.
--->
+The worst-case scenario is a critical bug in the new Workload Scheduling Cycle code causing a scheduler crash-loop. 
+This would stop all scheduling but would not impact already running workloads and rollback is a sufficient
+mitigation method.
 
 ###### What specific metrics should inform a rollback?
 
-<!--
-What signals should users be paying attention to when the feature is young
-that might indicate a serious problem?
--->
+- `scheduler_schedule_attempts_total{result="error"}`: A sudden spike indicates internal errors or panics within 
+the scheduling loop, possibly caused by the new logic.
+- `process_start_time_seconds`: Frequent resets of this metric indicate that the scheduler process is crashing and 
+  restarting (crash loop).
+- `scheduler_pod_scheduling_duration_seconds`: A significant regression in P99 latency for standard (non-gang) pods 
+  would indicate that the overhead of the new logic is unacceptable.
+- `scheduler_pod_group_scheduling_attempts_total` (new metric, TODO: check with Maciek on the metric name): Consistently 
+  high failure rates for valid gangs compared to successful attempts.
+- `scheduler_preemption_attempts_total`, `scheduler_preemption_victims`: A sudden increase might indicate that the 
+  new "delayed preemption" logic is malfunctioning (e.g., triggering unnecessary preemptions).
 
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
+We'll perform manual testing of the upgrade -> downgrade -> upgrade path using the following sequence:
+
+1. Start a local Kubernetes v1.35 cluster with GenericWorkload and GangScheduling feature gates disabled (default 
+behavior).
+2. Attempt to create a Pod with `spec.workloadRef` set.
+3. The `spec.workloadRef` field is dropped by the API server. The pod is created successfully but without the workload
+   reference, resulting in immediate standard scheduling (one-by-one).
+4. Restart/Upgrade API Server and Scheduler to v1.36 (with feature gates enabled).
+5. Create two Workload objects: `gang-test-A` and `gang-test-B` (both with `minCount=2`).
+6. Create a Pod `test-pod-1` with `spec.workloadRef` pointing to `gang-test-A`.
+7. The Pod stays in `Pending` state (waiting for the gang). We verified that
+   `scheduler_pod_group_scheduling_attempts_total` metric is incremented.
+8. Create a Pod `test-pod-2` pointing to the same workload.
+9. Both pods are scheduled successfully in the same cycle (Gang Scheduling works). 
+10. Downgrade API Server and Scheduler to v1.35 (with feature gates disabled).
+11. Create `test-pod-3` pointing to `gang-test-B`. Note: We use a workload created in step 5 because creating new
+    Workload objects is disabled.
+12. The pod is scheduled immediately (Workload logic is ignored because the workloadRef field is dropped by
+    the v1.35 API server). If Gang Scheduling were active, this pod would hang pending waiting for a second member.
+13. Upgrade API Server and Scheduler back to v1.36 (feature gates enabled).
+14. Create `test-pod-4` and `test-pod-5` pointing to `gang-test-B`; verifying that Gang Scheduling functionality is
+    restored (these pods wait for `minCount=2` before scheduling).
 
-<!--
-Describe manual testing that was done and the outcomes.
-Longer term, we may want to require automated upgrade/rollback tests, but we
-are missing a bunch of machinery and tooling and can't do that now.
--->
 
 ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
 
@@ -806,11 +828,11 @@ previous answers based on experience in the field.
 
 ###### How can an operator determine if the feature is in use by workloads?
 
-<!--
-Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
-checking if there are objects with field X set) may be a last resort. Avoid
-logs or events for this purpose.
--->
+Operators can check the new `scheduler_pod_group_scheduling_attempts_total` metric. A value greater than zero 
+indicates that the scheduler is processing Workload Scheduling Cycles.
+
+Alternatively, checking for the existence of `Workload` via `kubectl get workloads` or checking the
+`pod.spec.workloadRef` field confirms that users are actively using the feature.
 
 ###### How can someone using this feature know that it is working for their instance?
 
@@ -823,50 +845,38 @@ and operation of this feature.
 Recall that end users cannot usually observe component logs or access metrics.
 -->
 
-- [ ] Events
-  - Event Reason: 
-- [ ] API .status
-  - Condition name: 
-  - Other field: 
-- [ ] Other (treat as last resort)
-  - Details:
+- [x] API .spec
+  - Other field: workloadRef is set on the Pods.
+- [x] Events 
+  - Event Type: Warning 
+  - Event Reason: FailedScheduling 
+  - Event Message: The message includes details if the scheduling failed due to gang constraints (e.g., "pod group 
+    minCount requirement not met").
 
 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
 
-<!--
-This is your opportunity to define what "normal" quality of service looks like
-for a feature.
-
-It's impossible to provide comprehensive guidance, but at the very
-high level (needs more precise definitions) those may be things like:
-  - per-day percentage of API calls finishing with 5XX errors <= 1%
-  - 99% percentile over day of absolute value from (job creation time minus expected
-    job creation time) for cron job <= 10%
-  - 99.9% of /health requests per day finish with 200 code
-
-These goals will help you determine what you need to measure (SLIs) in the next
-question.
--->
+Since there are no formal SLOs for the kube-scheduler apart from scalability SLOs, we define the objectives for this
+feature primarily in terms of non-regression to ensure the workload scheduling does not degrade the performance of the
+standard scheduling loop.
 
-###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+- Scheduling Throughput: There should be no significant regression in the system-wide scheduling throughput (pods/s) 
+when scheduling pods attached to a Workload compared to scheduling an equivalent number of individual pods.
 
-<!--
-Pick one more of these and delete the rest.
--->
+- Scheduling Latency: There should be no significant regression in pod scheduling latency 
+(`scheduler_pod_scheduling_duration_seconds`) for both workload and non-workload pods compared to the baseline.
 
-- [ ] Metrics
+###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+
+- [x] Metrics 
   - Metric name:
-  - [Optional] Aggregation method:
-  - Components exposing the metric:
-- [ ] Other (treat as last resort)
-  - Details:
+    - `scheduler_pod_group_scheduling_attempts_total`
+    - `scheduler_pod_group_scheduling_duration_seconds`
+    - `scheduler_pod_group_scheduling_algorithm_duration_seconds`
+  - Components exposing the metric: kube-scheduler
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
-<!--
-Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
-implementation difficulties, etc.).
--->
+No.
 
 ### Dependencies
 
@@ -945,23 +955,49 @@ details). For now, we leave it here.
 
 ###### How does this feature react if the API server and/or etcd is unavailable?
 
+The behavior is consistent with the status quo. Since the scheduler cannot bind pods or update statuses without the
+API server, any in-flight workload scheduling will eventually fail at the binding/update stage. These attempts will be
+retried with standard exponential backoff once connectivity is restored.
+
 ###### What are other known failure modes?
 
-<!--
-For each of them, fill in the following information by copying the below template:
-  - [Failure mode brief description]
-    - Detection: How can it be detected via metrics? Stated another way:
-      how can an operator troubleshoot without logging into a master or worker node?
-    - Mitigations: What can be done to stop the bleeding, especially for already
-      running user workloads?
-    - Diagnostics: What are the useful log messages and their required logging
-      levels that could help debug the issue?
-      Not required until feature graduated to beta.
-    - Testing: Are there any tests for failure mode? If not, describe why.
--->
+- Pods Pending Indefinitely - Waiting for Gang Assembly (PreEnqueue)
+  - Detection:
+    - Check Pod Events/Status. Expected reason: a message indicating that the pod is waiting for more gang members.
+    - The number of pending pods belonging to the group is less than minCount.
+  - Mitigations:
+    - Ensure the controller created all required pods.
+    - If intended, delete the Workload object to disable gang scheduling (fallback to best-effort scheduling) if
+      acceptable.
+  - Diagnostics:
+    - Scheduler logs at V=4 searching for "workload" to trace the decision flow.
+    - Verify minCount in the Workload matches the number of pods created by the Job/Controller.
+  - Testing:
+    - Covered by integration tests submitting partial gangs.
+- Pods Pending Indefinitely - Gang cannot fit (Resource Constraints)
+  - Detection: Check Pod Events/Status. Expected reason: a message indicating that minCount pods could not be
+    scheduled.
+  - Metrics: `scheduler_pod_group_scheduling_attempts_total` with result unschedulable.
+  - Mitigations:
+    - Scale up the cluster (add nodes) or delete other real-workloads to free up space.
+    - If intended, delete the Workload object to disable gang scheduling (fallback to best-effort scheduling) if
+      acceptable.
+  - Diagnostics:
+    - Scheduler logs at V=4 searching for "workload" to see detailed reasons why the placement failed.
+  - Testing:
+    - Covered by integration tests submitting gangs larger than cluster capacity.
 
 ###### What steps should be taken if SLOs are not being met to determine the problem?
 
+1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and 
+   `scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling  
+   Cycle logic itself is computationally expensive and causing the regression.
+2. Inspect Logs: Enable scheduler logging at V=4 to trace the execution time of individual Workload Scheduling 
+   Cycles and identify if specific large gangs are blocking the queue. 
+3. Disable Feature: If the regression is critical and impacting cluster health, disable the GangScheduling feature 
+   gate. This will revert the scheduler to the standard pod-by-pod logic, restoring baseline performance (at the 
+   cost of losing gang semantics).
+
 ## Implementation History
 
 <!--

diff --git a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
@@ -8,6 +8,7 @@ authors:
  -    "@44past4"
  -    "@andreyvelich" 
  -    "@thockin"
+ -    "@mm4tt"
 
 owning-sig: sig-scheduling
 participating-sigs:
@@ -27,12 +28,12 @@ replaces:
 # The target maturity stage in the current dev cycle for this KEP.
 # If the purpose of this KEP is to deprecate a user-visible feature
 # and a Deprecated feature gates are added, they should be deprecated|disabled|removed.
-stage: alpha
+stage: beta
 
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.35"
+latest-milestone: "v1.36"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
@@ -54,3 +55,6 @@ disable-supported: true
 
 # The following PRR answers are required at beta release
 metrics:
+  - scheduler_pod_group_scheduling_attempts_total
+  - scheduler_pod_group_scheduling_duration_seconds
+  - scheduler_pod_group_scheduling_algorithm_duration_seconds