-
Story
-
Resolution: Done
-
Normal
-
None
-
None
-
None
there are 4 containers on our openshift master nodes that do not respond to a
SIGTERM when the node is scheduled to reboot on an upgrade. The host eventually
times out after 30 seconds after the SIGTERM and issued a SIGKILL which then
terminates the container and the node can continue it's reboot.
this ticket is for the webhook container with relevant logs below.
the ovndbchecker container has already been fixed with this commit as an
example.
❯ rg 9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec journal 23278:Nov 17 11:21:27.147247 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Started crio-conmon-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope. 23279:Nov 17 11:21:27.235262 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Started libcontainer container 9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec. 23281:Nov 17 11:21:27.380443 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 11:21:27.379000147Z" level=info msg="Created container 9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec: openshift-multus/multus-admission-controller-2tpkp/multus-admission-controller" id=d78917a3-5cd1-4914-ad02-27644c56e3b7 name=/runtime.v1alpha2.RuntimeService/CreateContainer 23282:Nov 17 11:21:27.380443 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 11:21:27.380086794Z" level=info msg="Starting container: 9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec" id=24ac1a0b-1de0-472e-ab50-9a1bc8a63cb3 name=/runtime.v1alpha2.RuntimeService/StartContainer 23283:Nov 17 11:21:27.402583 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 11:21:27.402474821Z" level=info msg="Started container" PID=166795 containerID=9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec description=openshift-multus/multus-admission-controller-2tpkp/multus-admission-controller id=24ac1a0b-1de0-472e-ab50-9a1bc8a63cb3 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=ac1ec1c837d8712dea9c8918bfdbb6ce539b9c37f1d68c5f95a990bbf80d89f5 23296:Nov 17 11:21:27.989837 ci-op-n70c47rd-82914-54mxf-master-0 hyperkube[1925]: I1117 11:21:27.988636 1925 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-multus/multus-admission-controller-2tpkp" event=&{ID:839e8d29-06c4-41d2-b8bf-256438a50e4c Type:ContainerStarted Data:9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec} 29849:Nov 17 11:39:48.993369 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Stopping libcontainer container 9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec. 30274:Nov 17 11:40:19.150464 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope: Stopping timed out. Killing. 30275:Nov 17 11:40:19.150964 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope: Killing process 166795 (webhook) with signal SIGKILL. 30282:Nov 17 11:40:19.162983 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-conmon-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope: Succeeded. 30283:Nov 17 11:40:19.164027 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-conmon-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope: Consumed 54ms CPU time 30293:Nov 17 11:40:19.183735 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope: Failed with result 'timeout'. 30294:Nov 17 11:40:19.184879 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Stopped libcontainer container 9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec. 30295:Nov 17 11:40:19.194430 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope: Consumed 2.328s CPU time
the above log came from this journal file produced from this job.