Skip to content

Conversation

@helayoty
Copy link
Member

@helayoty helayoty commented Feb 2, 2026

/sig scheduling

Signed-off-by: helayoty <heelayot@microsoft.com>
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 2, 2026
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Feb 2, 2026
@k8s-ci-robot k8s-ci-robot requested a review from kow3ns February 2, 2026 16:53
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 2, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: helayoty
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Feb 2, 2026
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Feb 2, 2026
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 2, 2026
@helayoty
Copy link
Member Author

helayoty commented Feb 2, 2026

cc @mm4tt @erictune @soltysh @kannon92

components:
- kube-apiserver
- kube-controller-manager
- kube-scheduler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there are any changes required for kube-scheduler

@@ -0,0 +1,3 @@
kep-number: 5547
alpha:
approver: "@soltysh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh is listed as approver for sig-apps also.

We should probably look for someone else to approve PRR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please post on #prod-readiness to see who can take it.

### Goals

- Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling.
- Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would break if JobSet also adds gang support.

How can someone opt out of this even if parallelism > 1?

@helayoty
Copy link
Member Author

helayoty commented Feb 2, 2026

/sig apps

We will add the following integration tests to the Job controller `https://github.com/kubernetes/kubernetes/blob/v1.35.0/test/integration/job/job_test.go`:
- Gang and Basic Scheduling Lifecycle Test (create, update, delete Job, verify Workload and PodGroup creation, verify pods have workloadRef, verify Job deletion cascades to Workload and PodGroup deletion)
- Failure Recovery Test (create Job with Workload API unavailable, verify Job controller retries, verify Workload is eventually created)
- Feature gate disable/enable (Jobs work without Workload/PodGroup creation (Jobs with ownerReferences managed by higher-level controllers do not create Workload/PodGroup))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few areas we need to cover in alpha:

  • How does this feature work with suspended jobs?
  • If a job has ownerreferences set can we verify that no workload is created?
  • ElasticJob is forbidden. We should test/verify this.

- The automatic policy selection is based on `Job` Type
- Jobs with `parallelism > 1` use gang scheduling policy where `minCount` equals the Job's parallelism.
- Jobs without indexed completion mode or `completions = 1`, use basic scheduling policy (pod-by-pod scheduling - `minCount`).
- Elastic Jobs (changing parallelism at runtime) are not supported when gang scheduling is active.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively breaks Elastic Indexed Jobs, which is a GA feature, when this feature turns on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to forbid modifying parallelism in this case but do it in a way that doesn't break existing users of this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: Needs Triage
Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants