-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5823: Pod-Level Checkpoint/Restore #5851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
rst0git
commented
Jan 29, 2026
- One-line PR description: Enable support for Pod-level checkpoint and restore.
- Issue link: Pod-Level Checkpoint/Restore #5823
- Other comments: Related to Forensic Container Checkpointing #2008 and Checkpointing API #5091
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rst0git The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| allows to transparently capture the state of these workloads and to resume execution | ||
| from the most recent snapshot in the case of failures. | ||
|
|
||
| [CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads](https://arxiv.org/abs/2502.16631) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rst0git can we somehow make the checkpointing and restore tool more generic. I can see the word CRIU few times on the document and as other members asked last meeting, we should consider a proposal that would also enable gVisor, kata-containers or any other qemu/KVM based runtime to profit of this implementation on Kubernetes, as different tools might not use CRIU to C/R.
For GPU other runtimes also use cuda-checkpoint as for example gVisor, but no Criu involved as they implement everything on runsc instead.
|
|
||
| ### Non-Goals | ||
|
|
||
| * The initial Pod-level checkpoint and restore implementation is limited to a subset of Pod resources. The coverage of additional Pod and cluster resources is out of scope for this proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you enumerate what is out of scope here?
| * Behavior: | ||
| * Recreates Pod and namespaces | ||
| * Restores container runtime state | ||
| * Reconciles networking and storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may be tricker than this bullet implies, may need to be stateless and not hold pod IP to start.
| * Checkpoint identifier | ||
| * Operation status | ||
|
|
||
| #### RestorePod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is restore in scope for this KEP
| * Scheduling constraints and security contexts | ||
|
|
||
| #### Container Runtime State | ||
| * CRIU-generated checkpoint images for each container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fals what does this look like for other non CRIU checkpointing options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gVisor does not bump in same format as CRIU and QEMU/kvm based are even less aligned as it just snapshots the whole microVM which has all pods and containers inside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we need to specifically say what the runtime does here. I would like to find a way that we can phrase this KEP in a way that is implementation agnostic, while still maintaining shared pieces (if we pass an option down, we need to find a way that option can be passed to the various implementations, if relevant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say we should phrase that the
Checkpoint can be generated by different tools depending on the runtime used by the pod. The same tool should be used to restore as there's no backwards compatibility between them. The high level CRD holding information about the checkpointed pod MUST contain details about the tool used during checkpoint and it will be passed downstream from API server to CRI during restoration. The solution must be tool agnostic and any additional param needed by specific tools can be passed during C/R using
command-line argumentsfrom API.
| * Container-specific security contexts and capabilities | ||
|
|
||
| #### Shared Pod Resources | ||
| * Pod network namespace state, including Pod IP address where feasible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll need to loop in sig networking here for pod IP saving, I think we should do that as a follow-on
| * Pod network namespace state, including Pod IP address where feasible | ||
|
|
||
| Initial support focuses on metadata and container runtime state. Additional resources such as | ||
| shared memory, EmptyDir volumes, and other volume types will be added in future iterations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd even go so far as to say a checkpointed pod should be stateless for the first iteration, and then we address stateful pods in a follow-on kep
| * Restoring | ||
|
|
||
| #### Allowed Transitions | ||
| * Running → Checkpointing → Running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"running" is not a pod state though, there's a lot of pod states and the state machine is pretty complex. can we checkpoint if a pod is being resized in place? how about if the init containers are still running? I think we need to muscle through this a bit more
| - "@haircommander" | ||
| owning-sig: sig-node | ||
| participating-sigs: | ||
| - sig-node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't this also participating sigs?
- sig-storage
- sig-api-machinery
|
Hi @rst0git , I'm Wendy from SIG Node KEP Wrangler and this KEP is at risk for PRR deadline at the moment, are you still aiming for this KEP in v1.36? The PRR deadline is approaching tomorrow (Wednesday 4th February 2026 (AoE) / Thursday 5th February 2026, 12:00 UTC). This is checklist we need to meet to be able to pass PRR, do you think we can land these requirements before deadline or you need more time for it? (considering for exception request in advance).
|
25d89d1 to
155019a
Compare
|
@wendy-ha18 Thank you for your message! I've updated the pull request with the KEP's PRR questionnaire and |
34973b6 to
194e916
Compare
|
Thank @rst0git , one thing left to ensure we can safely pass PRR deadline is I'm not really sure who is PRR reviewer (and PRR shadow) for this KEP. I have asked in #prod-readiness channel in Slack here. Please feel free to follow up further with it when you have time. |
194e916 to
c69e12b
Compare
Co-authored-by: Adrian Reber <areber@redhat.com> Co-authored-by: Dan Feigin <dfeigin@nvidia.com> Signed-off-by: Radostin Stoyanov <radostin.stoyanov@eng.ox.ac.uk>
c69e12b to
92c6615
Compare