-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.14.z
-
Low
-
None
-
3
-
False
-
-
Node shutdown time varies with crun when containers start on boot vs are started later. With runc, this cannot be observed and the system will always respect all containers' terminationGracePeriodSeconds on shutdown.
—
With crun:
If I reboot the node, wait for the node to come up fully including all pods. Then I create the deployment, scale it or simply delete the pod after reboot so that it's recreated:
- shutdown and more specifically the network target are blocked for 300 seconds until the process is stopped with the KILL signal
If I reboot the node and wait for all crio containers to be brought up on boot, including that very same pod, but I do not instruct the API to start or stop pods:
- shutdown and more specifically the network target are not blocked; the journal and the network targets are stopped after a few very short seconds
—
With runc:
I always get:
shutdown and more specifically the network target are blocked for 300 seconds until the process is stopped with the KILL signal
—
I'm running this test on a 4.14.23 baremetal SNO node with crun. Note that I found this in a lab while testing something for a customer with a wrong configuration. That specific customer is not using crun, but runc instead - I think it's still worth investigating this issue for crun as it is easy to reproduce and yields confusing results that diverge from the node's behavior with runc.
—
Spawn the following deployment:
cat <<'EOF' | oc apply -f - apiVersion: apps/v1 kind: Deployment metadata: labels: app: test name: test namespace: default spec: replicas: 1 selector: matchLabels: app: test template: metadata: labels: app: test spec: containers: - command: - /bin/bash - "-c" - | trap -- '' SIGINT SIGTERM while true; do date sleep 1 done image: registry.fedoraproject.org/fedora:latest imagePullPolicy: IfNotPresent name: fedora terminationGracePeriodSeconds: 300 EOF
The pod of this deployment will have the following attributes:
- its command cannot be stopped by SIGINT or SIGTERM, hence SIGKILL is needed
- wait for 300 seconds for a graceful shutdown of the process
--> When the deployment's pod is deleted, it will take 300 seconds to stop.
Test how long it takes to shut down the node:
==========
Ping the node IP:
ping <host IP>
Shutdown the node
ssh core@<host IP> reboot
Results:
i) SSH disconnects immediately, immediate shutdown of Network target and of journal; ping stops nearly immediately; processes running in crio containers are stopped **after** the journal stopped (can be seens via IPMI) (1_1.png) ii) SSH disconnects immediately, shutdown of crio takes a few minutes to complete; shutdown of Network target, of journal; ping stops working after minutes a) shutdown takes less than 5 minutes, and some containers are still stopped _after_ the journal shut down b) shutdown takes the expected 5 minutes because the test-... pod's containers are blocking (2_b_1.png) even after this, we still see: Waiting for process (2_b_2.png)
The problem is that the node shutdown times look at first ... well ... random. Sometimes, the node takes North of 5 minutes to shut down, sometimes it's way faster. As stated above, after further testing I realized that it is tied to crun, and that it has to do with when the pod is started: if it's started right during the system startup phase, then the network will be shut down nearly immediately after running the `reboot` command. However, with crun, when I start a pod with terminationGracePeriodSeconds some time after the node booted, the shutdown will be delayed until my pod exits.
In the logs, look for the time between:
Jul 11 17:36:41 sno10.workload.bos2.lab systemd-logind[2242]: System is rebooting.
(...)
and the actual end of the log. Provided data is for the exact same system, just on 2 consecutive boots, and the only difference is if my pod is started automatically on boot, or if I start it via the API at some later point.