-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.14
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
In MCO CI, our e2e-gcp-op has been blocked for a week or so: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op Looking into the individual tests, we are taking 2x to 2.5x as long to run any tests: Successful runs from before: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3441/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1649034037110509568/artifacts/e2e-gcp-op/test/build-log.txt https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3598/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1648916567469068288/artifacts/e2e-gcp-op/test/build-log.txt https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3682/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1648771501777752064/artifacts/e2e-gcp-op/test/build-log.txt Failing runs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3663/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1652001767627427840/artifacts/e2e-gcp-op/test/build-log.txt https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3505/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1651718542346686464/artifacts/e2e-gcp-op/test/build-log.txt https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3676/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1652000289886048256/artifacts/e2e-gcp-op/test/build-log.txt Note that each test, although successful, now takes 2x to 2.5x the amount of time, and as such we hit a global timeout of 2hr30mins. I brought up a cluster on the latest successful nightly: 4.14.0-0.nightly-2023-05-01-124309 And observed a test run. During the node shutdown, the pod: openshift-e2e-loki/loki-promtail-xxx/promtail needs to be SIGKILL'ed, causing the reboot to go from ~2mins to ~4 mins. I hope Test Framework is the right component for this pod? Log snippet from journalctl: May 02 16:30:42 ci-ln-9pxrvct-72292-wmpkf-worker-b-cw2lb systemd[1]: crio-add09cca3149cd8d19feb07b09f4d9a7b6c19bda976d500b707052d935eb215b.scope: Stopping timed out. Killing. May 02 16:30:42 ci-ln-9pxrvct-72292-wmpkf-worker-b-cw2lb systemd[1]: crio-add09cca3149cd8d19feb07b09f4d9a7b6c19bda976d500b707052d935eb215b.scope: Killing process 6196 (promtail) with signal SIGKILL. May 02 16:30:42 ci-ln-9pxrvct-72292-wmpkf-worker-b-cw2lb conmon[6109]: conmon add09cca3149cd8d19fe <ninfo>: container 6196 exited with status 137 May 02 16:30:42 ci-ln-9pxrvct-72292-wmpkf-worker-b-cw2lb systemd[1]: crio-add09cca3149cd8d19feb07b09f4d9a7b6c19bda976d500b707052d935eb215b.scope: Failed with result 'timeout'. I captured a sosreport on a node after the reboot: https://drive.google.com/file/d/1cu7s87EniEDbGosKfp1U5k2ztf2T_S4Y/view?usp=share_link
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Spin up a 4.14 nightly from the most recent week or so 2. Run a MachineConfig update 3. Watching journal logs
Actual results:
promtail is SIGKILL'ed
Expected results:
promtail does not require SIGKILL'ing to terminate
Additional info:
Marked critical since this is blocking our CI. Please move to the correct component if this isn't the right place