-
Bug
-
Resolution: Done
-
Undefined
-
None
-
4.20
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
TLDR: seems to be https://issues.redhat.com/browse/OCPBUGS-13020 again.
MCO blocking pre-submit job e2e-gcp-op is timing out on 4.20. Looking through some example failures, we see a drastic increase in reboot time. Looking at older runs, the max amount of time between MCD restarts are about 1.5 minutes. 2025-06-12T15:27:40.782632915+00:00 stderr F I0612 15:27:40.762518 2602 daemon.go:1420] Shutting down MachineConfigDaemon ... ... 2025-06-12T15:29:00.179980191+00:00 stderr F I0612 15:29:00.179911 2607 start.go:68] Version: machine-config-daemon-4.6.0-202006240615.p0-3497-gc4676685-dirty (c467668557d7ff4f323e3dfc119de326ed6237db) On the newer runs that timeout, the max amount of time is about 4 minutes: 2025-06-25T14:20:14.210451103+00:00 stderr F I0625 14:20:14.210429 2595 daemon.go:1420] Shutting down MachineConfigDaemon .... .... 2025-06-25T14:24:00.376789443+00:00 stderr F I0625 14:24:00.376725 2620 start.go:68] Version: machine-config-daemon-4.6.0-202006240615.p0-3513-g2499d899-dirty (2499d8996a2ba9f24b55a927a116d68e6ec9680a) 2025-06-25T14:24:00.384100575+00:00 stderr F I0625 14:24:00.384004 2620 update.go:2594] Running: mount --rbind /run/secrets /rootfs/run/secretsSo we're losing 3 minutes somewhere after the MCD shuts down. When we look at the journals on this node at this window: Jun 25 14:20:14.087396 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-infra-3d9079bedd38a818d4f3247b72e5b14d. Jun 25 14:20:14.092601 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq root[6130]: machine-config-daemon[2595]: "reboot successful" Jun 25 14:20:14.104645 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd-logind[952]: The system will reboot now! Jun 25 14:20:14.108018 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd-logind[952]: System is rebooting. ... ... Jun 25 14:20:34.173082 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-conmon-b9bb524471736b4337196b2121d482dbe140c1a8bf388460ddefd9e4b2fe9bca.scope: Deactivated successfully. =======[BLIP OF ~3 MINS here]======================== Jun 25 14:23:14.280408 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Stopping timed out. Killing. Jun 25 14:23:14.281032 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4126 (promtail) with signal SIGKILL. Jun 25 14:23:14.281273 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4132 (promtail) with signal SIGKILL. Jun 25 14:23:14.281305 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4152 (n/a) with signal SIGKILL. Jun 25 14:23:14.281326 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4153 (n/a) with signal SIGKILL. Jun 25 14:23:14.281352 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4163 (promtail) with signal SIGKILL. Jun 25 14:23:14.281366 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4164 (n/a) with signal SIGKILL. Jun 25 14:23:14.281377 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Killing process 4255 (n/a) with signal SIGKILL. Jun 25 14:23:14.289820 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: crio-2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8.scope: Failed with result 'timeout'. Jun 25 14:23:14.290292 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq systemd[1]: Stopped libcrun container. The referenced scope is loki-promtail: Jun 25 14:18:56.330012 ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq crio[2425]: time="2025-06-25T14:18:56.329927842Z" level=info msg="Created container 2de1d0b8f3eeb7a4fb3b0ea6149320fb970366b653057d3f0c426cceedda9cd8: openshift-e2e-loki/loki-promtail-qppd5/promtail" id=914eebbb-c8e5-41ed-908e-2da5731e9712 name=/runtime.v1.RuntimeService/CreateContainer Example run journal logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_release/66378/rehearse-66378-pull-ci-openshift-machine-config-operator-main-e2e-gcp-op/1937821703648317440/artifacts/e2e-gcp-op/gather-extra/artifacts/nodes/ci-op-gwhnctd5-27524-j9qk8-worker-c-k6jsq/
Version-Release number of selected component (if applicable):
4.20
How reproducible:
Almost always (but some reboots seem to get lucky)
Steps to Reproduce:
1. Check CI logs
2.
3.
Actual results:
loki-promtail fails
Expected results:
loki-promtail should either be shutdown cleanly or not be running
Additional info: