Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-58178

Node reboots in 4.20 taking twice as long to complete due to promtail needing to be SIGKILL'ed

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • 4.20
    • Test Framework
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      TLDR: seems to be https://issues.redhat.com/browse/OCPBUGS-13020 again.

      MCO blocking pre-submit job e2e-gcp-op is timing out on 4.20. Looking through some example failures, we see a drastic increase in reboot time.
      
      
      Looking at older runs, the max amount of time between MCD restarts are about 1.5 minutes.
      
      2025-06-12T15:27:40.782632915+00:00 stderr F I0612 15:27:40.762518    2602 daemon.go:1420] Shutting down MachineConfigDaemon
      ...
      ...
      2025-06-12T15:29:00.179980191+00:00 stderr F I0612 15:29:00.179911    2607 start.go:68] Version: machine-config-daemon-4.6.0-202006240615.p0-3497-gc4676685-dirty (c467668557d7ff4f323e3dfc119de326ed6237db)
      
      
      On the newer runs that timeout, the max amount of time is about 4 minutes:
      2025-06-25T14:20:14.210451103+00:00 stderr F I0625 14:20:14.210429    2595 daemon.go:1420] Shutting down MachineConfigDaemon
      ....
      ....
      2025-06-25T14:24:00.376789443+00:00 stderr F I0625 14:24:00.376725    2620 start.go:68] Version: machine-config-daemon-4.6.0-202006240615.p0-3513-g2499d899-dirty (2499d8996a2ba9f24b55a927a116d68e6ec9680a)
      2025-06-25T14:24:00.384100575+00:00 stderr F I0625 14:24:00.384004    2620 update.go:2594] Running: mount --rbind /run/secrets /rootfs/run/secretsSo we're losing 3 minutes somewhere after the MCD shuts down. When we look at the journals on this node at this window:
      Jun 25 14:20:14.087396 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-infra-3d9079bedd38a818d4f3247b72e5b14d.
      Jun 25 14:20:14.092601 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq root[6130]: machine-config-daemon[2595]: "reboot successful"
      Jun 25 14:20:14.104645 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd-logind[952]: The system will reboot now!
      Jun 25 14:20:14.108018 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd-logind[952]: System is rebooting.
      ...
      ...
      Jun 25 14:20:34.173082 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-conmon-b9bb524471736b4337196b2121d482dbe140c1a8bf388460ddefd9e4b2fe9bca.scope: Deactivated successfully.
      =======[BLIP OF ~3 MINS here]========================
      Jun 25 14:23:14.280408 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Stopping timed out. Killing.
      Jun 25 14:23:14.281032 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4126 (promtail) with signal SIGKILL.
      Jun 25 14:23:14.281273 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4132 (promtail) with signal SIGKILL.
      Jun 25 14:23:14.281305 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4152 (n/a) with signal SIGKILL.
      Jun 25 14:23:14.281326 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4153 (n/a) with signal SIGKILL.
      Jun 25 14:23:14.281352 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4163 (promtail) with signal SIGKILL.
      Jun 25 14:23:14.281366 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4164 (n/a) with signal SIGKILL.
      Jun 25 14:23:14.281377 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4255 (n/a) with signal SIGKILL.
      Jun 25 14:23:14.289820 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Failed with result 'timeout'.
      Jun 25 14:23:14.290292 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: Stopped libcrun container.
      
      
      The referenced scope is loki-promtail:
      
      Jun 25 14:18:56.330012 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq crio[2425]: time="2025-06-25T14:18:56.329927842Z" level=info msg="Created container 2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8: openshift-e2e-loki/loki-promtail-qppd5/promtail" id=914eebbb-c8e5-41ed-908e-2da5731e9712 name=/runtime.v1.RuntimeService/CreateContainer
      
      Example run journal logs:
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_release/66378/rehearse-66378-pull-ci-openshift-machine-config-operator-main-e2e-gcp-op/1937821703648317440/artifacts/e2e-gcp-op/gather-extra/artifacts/nodes/ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq/

      Version-Release number of selected component (if applicable):

          4.20

      How reproducible:

          Almost always (but some reboots seem to get lucky)

      Steps to Reproduce:

          1. Check CI logs
          2.
          3.
          

      Actual results:

          loki-promtail fails

      Expected results:

          loki-promtail should either be shutdown cleanly or not be running

      Additional info:

          

              rhn-engineering-dgoodwin Devan Goodwin
              jerzhang@redhat.com Yu Qi Zhang
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: