Skip to content

Conversation

@rst0git
Copy link
Member

@rst0git rst0git commented Jan 29, 2026

  • One-line PR description: Enable support for Pod-level checkpoint and restore.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 29, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rst0git
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 29, 2026
allows to transparently capture the state of these workloads and to resume execution
from the most recent snapshot in the case of failures.

[CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads](https://arxiv.org/abs/2502.16631)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rst0git can we somehow make the checkpointing and restore tool more generic. I can see the word CRIU few times on the document and as other members asked last meeting, we should consider a proposal that would also enable gVisor, kata-containers or any other qemu/KVM based runtime to profit of this implementation on Kubernetes, as different tools might not use CRIU to C/R.

For GPU other runtimes also use cuda-checkpoint as for example gVisor, but no Criu involved as they implement everything on runsc instead.


### Non-Goals

* The initial Pod-level checkpoint and restore implementation is limited to a subset of Pod resources. The coverage of additional Pod and cluster resources is out of scope for this proposal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you enumerate what is out of scope here?

* Behavior:
* Recreates Pod and namespaces
* Restores container runtime state
* Reconciles networking and storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be tricker than this bullet implies, may need to be stateless and not hold pod IP to start.

* Checkpoint identifier
* Operation status

#### RestorePod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is restore in scope for this KEP

* Scheduling constraints and security contexts

#### Container Runtime State
* CRIU-generated checkpoint images for each container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fals what does this look like for other non CRIU checkpointing options?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gVisor does not bump in same format as CRIU and QEMU/kvm based are even less aligned as it just snapshots the whole microVM which has all pods and containers inside.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we need to specifically say what the runtime does here. I would like to find a way that we can phrase this KEP in a way that is implementation agnostic, while still maintaining shared pieces (if we pass an option down, we need to find a way that option can be passed to the various implementations, if relevant)

Copy link

@fals fals Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say we should phrase that the

Checkpoint can be generated by different tools depending on the runtime used by the pod. The same tool should be used to restore as there's no backwards compatibility between them. The high level CRD holding information about the checkpointed pod MUST contain details about the tool used during checkpoint and it will be passed downstream from API server to CRI during restoration. The solution must be tool agnostic and any additional param needed by specific tools can be passed during C/R using command-line arguments from API.

* Container-specific security contexts and capabilities

#### Shared Pod Resources
* Pod network namespace state, including Pod IP address where feasible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll need to loop in sig networking here for pod IP saving, I think we should do that as a follow-on

* Pod network namespace state, including Pod IP address where feasible

Initial support focuses on metadata and container runtime state. Additional resources such as
shared memory, EmptyDir volumes, and other volume types will be added in future iterations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd even go so far as to say a checkpointed pod should be stateless for the first iteration, and then we address stateful pods in a follow-on kep

* Restoring

#### Allowed Transitions
* Running → Checkpointing → Running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"running" is not a pod state though, there's a lot of pod states and the state machine is pretty complex. can we checkpoint if a pod is being resized in place? how about if the init containers are still running? I think we need to muscle through this a bit more

- "@haircommander"
owning-sig: sig-node
participating-sigs:
- sig-node

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't this also participating sigs?

  • sig-storage
  • sig-api-machinery

@wendy-ha18
Copy link
Member

Hi @rst0git , I'm Wendy from SIG Node KEP Wrangler and this KEP is at risk for PRR deadline at the moment, are you still aiming for this KEP in v1.36? The PRR deadline is approaching tomorrow (Wednesday 4th February 2026 (AoE) / Thursday 5th February 2026, 12:00 UTC).

This is checklist we need to meet to be able to pass PRR, do you think we can land these requirements before deadline or you need more time for it? (considering for exception request in advance).

PR open or merged with the KEP's PRR questionnaire filled out. - PENDING

PR open or merged with kep.yaml updated with the stage, latest-milestone, and milestone struct filled out. - PENDING

PR open or merged with a PRR approval file with the PRR approver listed for the stage the KEP is targeting.- PENDING

@rst0git rst0git force-pushed the pod-checkpoint-restore branch 2 times, most recently from 25d89d1 to 155019a Compare February 3, 2026 14:40
@rst0git rst0git marked this pull request as ready for review February 3, 2026 14:43
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2026
@rst0git
Copy link
Member Author

rst0git commented Feb 3, 2026

@wendy-ha18 Thank you for your message! I've updated the pull request with the KEP's PRR questionnaire and kep.yaml filled out.

@rst0git rst0git force-pushed the pod-checkpoint-restore branch 2 times, most recently from 34973b6 to 194e916 Compare February 3, 2026 14:49
@wendy-ha18
Copy link
Member

Thank @rst0git , one thing left to ensure we can safely pass PRR deadline is I'm not really sure who is PRR reviewer (and PRR shadow) for this KEP.

I have asked in #prod-readiness channel in Slack here. Please feel free to follow up further with it when you have time.

@rst0git rst0git force-pushed the pod-checkpoint-restore branch from 194e916 to c69e12b Compare February 4, 2026 04:05
Co-authored-by: Adrian Reber <areber@redhat.com>
Co-authored-by: Dan Feigin <dfeigin@nvidia.com>
Signed-off-by: Radostin Stoyanov <radostin.stoyanov@eng.ox.ac.uk>
@rst0git rst0git force-pushed the pod-checkpoint-restore branch from c69e12b to 92c6615 Compare February 4, 2026 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants