Uploaded image for project: 'OpenShift SDN'
  1. OpenShift SDN
  2. SDN-2505

fix webhook container from ignoring SIGTERM on node reboot

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • None
    • None
    • 0
    • 0.000

      there are 4 containers on our openshift master nodes that do not respond to a
      SIGTERM when the node is scheduled to reboot on an upgrade. The host eventually
      times out after 30 seconds after the SIGTERM and issued a SIGKILL which then
      terminates the container and the node can continue it's reboot.

      this ticket is for the webhook container with relevant logs below.

      the ovndbchecker container has already been fixed with this commit as an
      example.

      ❯ rg 9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec journal
      23278:Nov 17 11:21:27.147247 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Started crio-conmon-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope.
      23279:Nov 17 11:21:27.235262 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Started libcontainer container 9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.
      23281:Nov 17 11:21:27.380443 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 11:21:27.379000147Z" level=info msg="Created container 9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec: openshift-multus/multus-admission-controller-2tpkp/multus-admission-controller" id=d78917a3-5cd1-4914-ad02-27644c56e3b7 name=/runtime.v1alpha2.RuntimeService/CreateContainer
      23282:Nov 17 11:21:27.380443 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 11:21:27.380086794Z" level=info msg="Starting container: 9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec" id=24ac1a0b-1de0-472e-ab50-9a1bc8a63cb3 name=/runtime.v1alpha2.RuntimeService/StartContainer
      23283:Nov 17 11:21:27.402583 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 11:21:27.402474821Z" level=info msg="Started container" PID=166795 containerID=9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec description=openshift-multus/multus-admission-controller-2tpkp/multus-admission-controller id=24ac1a0b-1de0-472e-ab50-9a1bc8a63cb3 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=ac1ec1c837d8712dea9c8918bfdbb6ce539b9c37f1d68c5f95a990bbf80d89f5
      23296:Nov 17 11:21:27.989837 ci-op-n70c47rd-82914-54mxf-master-0 hyperkube[1925]: I1117 11:21:27.988636    1925 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-multus/multus-admission-controller-2tpkp" event=&{ID:839e8d29-06c4-41d2-b8bf-256438a50e4c Type:ContainerStarted Data:9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec}
      29849:Nov 17 11:39:48.993369 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Stopping libcontainer container 9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.
      30274:Nov 17 11:40:19.150464 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope: Stopping timed out. Killing.
      30275:Nov 17 11:40:19.150964 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope: Killing process 166795 (webhook) with signal SIGKILL.
      30282:Nov 17 11:40:19.162983 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-conmon-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope: Succeeded.
      30283:Nov 17 11:40:19.164027 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-conmon-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope: Consumed 54ms CPU time
      30293:Nov 17 11:40:19.183735 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope: Failed with result 'timeout'.
      30294:Nov 17 11:40:19.184879 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Stopped libcontainer container 9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.
      30295:Nov 17 11:40:19.194430 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-9a0605eafc2aa958adc1c71aa4908b02d328e523fabca4fc74be954e1506a0ec.scope: Consumed 2.328s CPU time
      

      the above log came from this journal file produced from this job.

              nsimha@redhat.com Nikhil Simha (Inactive)
              jluhrsen Jamo Luhrsen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: