-
Story
-
Resolution: Done
-
Normal
-
None
-
None
-
Improvement
-
5
-
False
-
False
-
-
SDN Sprint 214, SDN Sprint 215, SDN Sprint 216
-
0
-
0.000
there are 4 containers on our openshift master nodes that do not respond to a
SIGTERM when the node is scheduled to reboot on an upgrade. The host eventually
times out after 30 seconds after the SIGTERM and issued a SIGKILL which then
terminates the container and the node can continue it's reboot.
this ticket is for the openshift-contr container with relevant logs below.
the ovndbchecker container has already been fixed with this commit as an
example.
❯ rg 2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf journal 19773:Nov 17 11:14:35.441260 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Started crio-conmon-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope. 19774:Nov 17 11:14:35.477752 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Started libcontainer container 2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf. 19776:Nov 17 11:14:35.718581 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 11:14:35.718514970Z" level=info msg="Created container 2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf: openshift-controller-manager/controller-manager-ksbrh/controller-manager" id=01ad450a-28c9-4006-86f9-e958e9891a7b name=/runtime.v1alpha2.RuntimeService/CreateContainer 19777:Nov 17 11:14:35.719549 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 11:14:35.719503667Z" level=info msg="Starting container: 2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf" id=ef4d9ac7-40cd-4f30-a5df-730fe61fc23d name=/runtime.v1alpha2.RuntimeService/StartContainer 19778:Nov 17 11:14:35.742594 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 11:14:35.742495085Z" level=info msg="Started container" PID=146548 containerID=2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf description=openshift-controller-manager/controller-manager-ksbrh/controller-manager id=ef4d9ac7-40cd-4f30-a5df-730fe61fc23d name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=1b90aa76ae76dbb70359070f7784c900e14a4db1c4a3cb545bf6bffa68b3a185 19779:Nov 17 11:14:35.830865 ci-op-n70c47rd-82914-54mxf-master-0 hyperkube[1925]: I1117 11:14:35.830822 1925 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-controller-manager/controller-manager-ksbrh" event=&{ID:7997c390-474b-4c7e-b4a0-2385e784da04 Type:ContainerStarted Data:2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf} 29858:Nov 17 11:39:48.998278 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Stopping libcontainer container 2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf. 30276:Nov 17 11:40:19.151103 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope: Stopping timed out. Killing. 30277:Nov 17 11:40:19.151373 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope: Killing process 146548 (openshift-contr) with signal SIGKILL. 30291:Nov 17 11:40:19.182309 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-conmon-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope: Succeeded. 30292:Nov 17 11:40:19.183528 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-conmon-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope: Consumed 47ms CPU time 30296:Nov 17 11:40:19.194780 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope: Failed with result 'timeout'. 30297:Nov 17 11:40:19.195502 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Stopped libcontainer container 2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf. 30298:Nov 17 11:40:19.205413 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope: Consumed 883ms CPU time
the above log came from this journal file produced from this job.
- clones
-
SDN-2505 fix webhook container from ignoring SIGTERM on node reboot
-
- Closed
-