Uploaded image for project: 'OpenShift SDN'
  1. OpenShift SDN
  2. SDN-2504

fix agnhost container from ignoring SIGTERM on node reboot

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Won't Do
    • Icon: Major Major
    • None
    • None
    • None
    • SDN Sprint 212, SDN Sprint 211, SDN Sprint 214
    • 0
    • 0.000

      there are 4 containers on our openshift master nodes that do not respond to a
      SIGTERM when the node is scheduled to reboot on an upgrade. The host eventually
      times out after 30 seconds after the SIGTERM and issued a SIGKILL which then
      terminates the container and the node can continue it's reboot.

      this ticket is for the agnhost container with relevant logs below.

      the ovndbchecker container has already been fixed with this commit as an
      example.

      ❯ rg 622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25 journal
      13802:Nov 17 10:54:35.437143 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Started crio-conmon-622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25.scope.
      13803:Nov 17 10:54:35.502725 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: run-runc-622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25-runc.C48ZnV.mount: Succeeded.
      13804:Nov 17 10:54:35.507217 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Started libcontainer container 622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25.
      13806:Nov 17 10:54:35.648448 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 10:54:35.648363913Z" level=info msg="Created container 622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25: e2e-k8s-sig-apps-daemonset-upgrade-9253/ds1-l6j6f/ds1" id=12ce2a57-cc11-4bd7-bf54-c568f8422f3a name=/runtime.v1alpha2.RuntimeService/CreateContainer
      13807:Nov 17 10:54:35.652309 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 10:54:35.650009675Z" level=info msg="Starting container: 622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25" id=0ef79ce7-a089-4872-a824-c88e7064a1d7 name=/runtime.v1alpha2.RuntimeService/StartContainer
      13808:Nov 17 10:54:35.723533 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 10:54:35.723453845Z" level=info msg="Started container" PID=92281 containerID=622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25 description=e2e-k8s-sig-apps-daemonset-upgrade-9253/ds1-l6j6f/ds1 id=0ef79ce7-a089-4872-a824-c88e7064a1d7 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=93b6249b837bba6b6479c71fae1f72fd1e4b69aba7c8ad5df7cf7eff9f2dd975
      13809:Nov 17 10:54:36.156269 ci-op-n70c47rd-82914-54mxf-master-0 hyperkube[1925]: I1117 10:54:36.156229    1925 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="e2e-k8s-sig-apps-daemonset-upgrade-9253/ds1-l6j6f" event=&{ID:b4bb63ec-1b81-4f37-a92b-1428f37da348 Type:ContainerStarted Data:622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25}
      29893:Nov 17 11:39:49.041298 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Stopping libcontainer container 622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25.
      30280:Nov 17 11:40:19.151721 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25.scope: *Stopping timed out. Killing*.
      30281:Nov 17 11:40:19.151940 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25.scope: Killing process 92281 (agnhost) with signal SIGKILL.
      30289:Nov 17 11:40:19.180974 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-conmon-622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25.scope: Succeeded.
      30290:Nov 17 11:40:19.182091 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-conmon-622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25.scope: Consumed 60ms CPU time
      30299:Nov 17 11:40:19.205993 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25.scope: Failed with result 'timeout'.
      30300:Nov 17 11:40:19.207206 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Stopped libcontainer container 622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25.
      30301:Nov 17 11:40:19.218412 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-622f9a946738f495201087c8ebf0f03ff704219d58c1a17db2a5e7673bcb7a25.scope: Consumed 195ms CPU time
      

      the above log came from this journal file produced from this job.

              mkennell@redhat.com Martin Kennelly
              jluhrsen Jamo Luhrsen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: