Uploaded image for project: 'OpenShift SDN'
  1. OpenShift SDN
  2. SDN-2506

fix openshift-contr container from ignoring SIGTERM on node reboot

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Normal Normal
    • openshift-4.11
    • None
    • None
    • SDN Sprint 214, SDN Sprint 215, SDN Sprint 216
    • 0
    • 0.0

      there are 4 containers on our openshift master nodes that do not respond to a
      SIGTERM when the node is scheduled to reboot on an upgrade. The host eventually
      times out after 30 seconds after the SIGTERM and issued a SIGKILL which then
      terminates the container and the node can continue it's reboot.

      this ticket is for the openshift-contr container with relevant logs below.

      the ovndbchecker container has already been fixed with this commit as an
      example.

      ❯ rg 2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf journal
      19773:Nov 17 11:14:35.441260 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Started crio-conmon-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope.
      19774:Nov 17 11:14:35.477752 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Started libcontainer container 2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.
      19776:Nov 17 11:14:35.718581 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 11:14:35.718514970Z" level=info msg="Created container 2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf: openshift-controller-manager/controller-manager-ksbrh/controller-manager" id=01ad450a-28c9-4006-86f9-e958e9891a7b name=/runtime.v1alpha2.RuntimeService/CreateContainer
      19777:Nov 17 11:14:35.719549 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 11:14:35.719503667Z" level=info msg="Starting container: 2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf" id=ef4d9ac7-40cd-4f30-a5df-730fe61fc23d name=/runtime.v1alpha2.RuntimeService/StartContainer
      19778:Nov 17 11:14:35.742594 ci-op-n70c47rd-82914-54mxf-master-0 crio[1900]: time="2021-11-17 11:14:35.742495085Z" level=info msg="Started container" PID=146548 containerID=2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf description=openshift-controller-manager/controller-manager-ksbrh/controller-manager id=ef4d9ac7-40cd-4f30-a5df-730fe61fc23d name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=1b90aa76ae76dbb70359070f7784c900e14a4db1c4a3cb545bf6bffa68b3a185
      19779:Nov 17 11:14:35.830865 ci-op-n70c47rd-82914-54mxf-master-0 hyperkube[1925]: I1117 11:14:35.830822    1925 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-controller-manager/controller-manager-ksbrh" event=&{ID:7997c390-474b-4c7e-b4a0-2385e784da04 Type:ContainerStarted Data:2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf}
      29858:Nov 17 11:39:48.998278 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Stopping libcontainer container 2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.
      30276:Nov 17 11:40:19.151103 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope: Stopping timed out. Killing.
      30277:Nov 17 11:40:19.151373 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope: Killing process 146548 (openshift-contr) with signal SIGKILL.
      30291:Nov 17 11:40:19.182309 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-conmon-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope: Succeeded.
      30292:Nov 17 11:40:19.183528 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-conmon-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope: Consumed 47ms CPU time
      30296:Nov 17 11:40:19.194780 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope: Failed with result 'timeout'.
      30297:Nov 17 11:40:19.195502 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: Stopped libcontainer container 2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.
      30298:Nov 17 11:40:19.205413 ci-op-n70c47rd-82914-54mxf-master-0 systemd[1]: crio-2dd608986a30264cafe5a0bb76dc60900e088f28d20e2c3faa847638deef07bf.scope: Consumed 883ms CPU time
      

      the above log came from this journal file produced from this job.

            jluhrsen Jamo Luhrsen
            jluhrsen Jamo Luhrsen
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: