XMLWordPrintable

Details

    • Feature
    • Resolution: Unresolved
    • Normal
    • None
    • None
    • Node
    • None
    • False
    • False
    • 50
    • 50% 50%
    • Undefined
    • 0
    • 0

    Description

      https://docs.google.com/document/d/1vTBX3U0ZtE0VGxupqQhyYFxRtETdFLfpX7XHEqaMi7Y/edit 

      What it is 

       

      Checkpoint & Restore allows you to freeze a running container by checkpointing it, which turns its state into a collection of files on disk. Later, the container can be restored from the point it was frozen.

       

      Use case 

       

      • Speeding up the start time of slow start applications
      • "Rewinding" processes to an earlier point in time
      • "Forensic debugging" of running processes

       

      Who can initiate the checkpoint/restore cluster/openshift admin or developer ?

       

      The current idea is that only the cluster administrator can do it. The first feature I am trying to get into Kubernetes is not yet about migration, but first about checkpointing and restoring of pods/containers. If we come to a point where we can migrate containers from one node to another it might be the cluster administrator or a policy decision.

       

      I know from Google’s internal migration support on Borg, that developers can mark their application as not being migratable. So a developer can tell that the application should never be migrated, but the migration is always triggered by an administrator or by a policy.

       

      What type of application will benefit from checkpoint/Restore 

       

      • Long running jobs , if stopped needs to be start from the beginning
      • Application that holds the state in memory 
      • Application that has long initialization time 

      [

      https://docs.google.com/document/d/1TP2Av7zHXz4_fmeX4q9HB0m9cqSZ4F6Jd4AiVoaF_2s/edit#heading=h.gaa58bzbvwde]

      Can we benchmark the use case ? For example it improves x% of starting time of a java application or can we quantify the benefits ?

       

      I (Adrian) only measured it with extremely simple WildFly applications and instead of 8 seconds startup time the restore takes 4 seconds. Not really a useful benchmark. But 50% faster.

       

      I am currently talking with Mathworks about a talk for Kubecon about checkpoint/restore/migration. Mathworks is using CRIU heavily in production to decrease startup time of their Matlab cloud offering. They can reduce the startup time of Matlab by 5 minutes using Docker’s checkpoint/restore implementation. They are interested in moving to Kubernetes and talked to me when they saw my KEP and pull requests.

      [

      https://docs.google.com/document/d/1TP2Av7zHXz4_fmeX4q9HB0m9cqSZ4F6Jd4AiVoaF_2s/edit#heading=h.gaa58bzbvwde]

      What is the upstream status of the project ? 

       

      Checkpointing/Restoring (all use cases from above) is supported in RHEL 8 with Podman and CRIU. So the base technology (checkpoint/restore) is already part of RHEL (since RHEL 7).

      [

      https://docs.google.com/document/d/1TP2Av7zHXz4_fmeX4q9HB0m9cqSZ4F6Jd4AiVoaF_2s/edit#heading=h.gaa58bzbvwde]

      Is there an openshift customer asking for it ?

      [

      https://docs.google.com/document/d/1TP2Av7zHXz4_fmeX4q9HB0m9cqSZ4F6Jd4AiVoaF_2s/edit#heading=h.gaa58bzbvwde]

      Do we have to create an operator in openshift to manage the lifecycle of checkpoints created by containers ? 

      [

      https://docs.google.com/document/d/1TP2Av7zHXz4_fmeX4q9HB0m9cqSZ4F6Jd4AiVoaF_2s/edit#heading=h.gaa58bzbvwde]

      What are our competitors doing to solve this problem ? GKE/EKS/AKS

       

      Google uses CRIU internally in their container runtime (Borg) for at least two years to live migrate long running containers if a node has not enough resources. There are a couple of talks about how they do it. They mentioned that they used to throw away containers which were running for a couple of hours if resources had to be freed and now they just migrate the container to another node. They tried to migrate interactive containers (gmail for example), but as the downtime during migration is always greater than 1 second they are not using migration for interactive users.

       

      LXC/LXD has CRIU support but not very well maintained and OpenVZ uses CRIU a lot (they invented it).

       

      I (Adrian) am not aware of any other CRIU integration in the orchestration layer.

      [

      https://docs.google.com/document/d/1TP2Av7zHXz4_fmeX4q9HB0m9cqSZ4F6Jd4AiVoaF_2s/edit#heading=h.gaa58bzbvwde]

      How can we track this work ?

       

      Gaurav will create a feature in Jira under OCPPLAN

       

      Where is the Checkpoint/restore saved ? in memory or persistent storage ?

       

      Persistent storage. Any directory can be used. Network mounted would be preferred for the migration case. We are currently discussing using registries to store checkpoints. The idea is to change the checkpoint image in such a way that it is an OCI image.

       

      Can we restore the POD using checkpoint/restore in another namespace( not in the namespace where it was created) ?

       

      Yes. PID namespace is created by the pause container and network namespace is created by CNI and containers can be restored into any new Pod. It is currently possible to checkpoint a container in Podman and restore it in CRI-O. runc/CRIU will restore the container into any existing namespaces depending on the value of config.json.

      Attachments

        Issue Links

          Activity

            People

              mpatel1@redhat.com Mrunal Patel
              gausingh@redhat.com Gaurav Singh
              Adrian Reber
              Mrunal Patel Mrunal Patel
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: