Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36892

Node shutdown time varies with crun when containers start on boot vs are started later

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.14.z
    • Containers
    • Low
    • None
    • 3
    • False
    • Hide

      None

      Show
      None

      Node shutdown time varies with crun when containers start on boot vs are started later. With runc, this cannot be observed and the system will always respect all containers' terminationGracePeriodSeconds on shutdown.

      With crun:
      If I reboot the node, wait for the node to come up fully including all pods. Then I create the deployment, scale it or simply delete the pod after reboot so that it's recreated:

      • shutdown and more specifically the network target are blocked for 300 seconds until the process is stopped with the KILL signal

      If I reboot the node and wait for all crio containers to be brought up on boot, including that very same pod, but I do not instruct the API to start or stop pods:

      • shutdown and more specifically the network target are not blocked; the journal and the network targets are stopped after a few very short seconds

      With runc:
      I always get:

      shutdown and more specifically the network target are blocked for 300 seconds until the process is stopped with the KILL signal

      I'm running this test on a 4.14.23 baremetal SNO node with crun. Note that I found this in a lab while testing something for a customer with a wrong configuration. That specific customer is not using crun, but runc instead - I think it's still worth investigating this issue for crun as it is easy to reproduce and yields confusing results that diverge from the node's behavior with runc.

      Spawn the following deployment:

      cat <<'EOF' | oc apply -f -
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        labels:
          app: test
        name: test
        namespace: default
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: test
        template:
          metadata:
            labels:
              app: test
          spec:
            containers:
            - command:
              - /bin/bash
              - "-c"
              - |
                trap -- '' SIGINT SIGTERM
                while true; do
                    date
                    sleep 1
                done
              image: registry.fedoraproject.org/fedora:latest
              imagePullPolicy: IfNotPresent
              name: fedora
            terminationGracePeriodSeconds: 300
      EOF
      

      The pod of this deployment will have the following attributes:

      • its command cannot be stopped by SIGINT or SIGTERM, hence SIGKILL is needed
      • wait for 300 seconds for a graceful shutdown of the process

      --> When the deployment's pod is deleted, it will take 300 seconds to stop.

      Test how long it takes to shut down the node:
      ==========

      Ping the node IP:

      ping <host IP>
      

      Shutdown the node

      ssh core@<host IP>
      reboot
      

      Results:

      i) SSH disconnects immediately, immediate shutdown of Network target and of journal; ping stops nearly immediately;
         processes running in crio containers are stopped **after** the journal stopped (can be seens via IPMI)  (1_1.png)
      
      ii) SSH disconnects immediately, shutdown of crio takes a few minutes to complete; shutdown of Network target, of journal; ping stops working after minutes
         a) shutdown takes less than 5 minutes, and some containers are still stopped _after_ the journal shut down
         b) shutdown takes the expected 5 minutes because the test-... pod's containers are blocking (2_b_1.png)
            even after this, we still see: Waiting for process  (2_b_2.png)
      

      The problem is that the node shutdown times look at first ... well ... random. Sometimes, the node takes North of 5 minutes to shut down, sometimes it's way faster. As stated above, after further testing I realized that it is tied to crun, and that it has to do with when the pod is started: if it's started right during the system startup phase, then the network will be shut down nearly immediately after running the `reboot` command. However, with crun, when I start a pod with terminationGracePeriodSeconds some time after the node booted, the shutdown will be delayed until my pod exits.

      In the logs, look for the time between:

      Jul 11 17:36:41 sno10.workload.bos2.lab systemd-logind[2242]: System is rebooting.
      (...)
      

      and the actual end of the log. Provided data is for the exact same system, just on 2 consecutive boots, and the only difference is if my pod is started automatically on boot, or if I start it via the API at some later point.

              kolyshkin Kirill Kolyshkin
              akaris@redhat.com Andreas Karis
              Sunil Choudhary Sunil Choudhary
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: