Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.14
Component/s: Test Framework
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

In MCO CI, our e2e-gcp-op has been blocked for a week or so: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op

Looking into the individual tests, we are taking 2x to 2.5x as long to run any tests:
Successful runs from before:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3441/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1649034037110509568/artifacts/e2e-gcp-op/test/build-log.txt
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3598/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1648916567469068288/artifacts/e2e-gcp-op/test/build-log.txt
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3682/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1648771501777752064/artifacts/e2e-gcp-op/test/build-log.txt

Failing runs:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3663/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1652001767627427840/artifacts/e2e-gcp-op/test/build-log.txt
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3505/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1651718542346686464/artifacts/e2e-gcp-op/test/build-log.txt
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3676/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1652000289886048256/artifacts/e2e-gcp-op/test/build-log.txt

Note that each test, although successful, now takes 2x to 2.5x the amount of time, and as such we hit a global timeout of 2hr30mins.

I brought up a cluster on the latest successful nightly: 4.14.0-0.nightly-2023-05-01-124309

And observed a test run. During the node shutdown, the pod: openshift-e2e-loki/loki-promtail-xxx/promtail needs to be SIGKILL'ed, causing the reboot to go from ~2mins to ~4 mins. I hope Test Framework is the right component for this pod?

Log snippet from journalctl:
May 02 16:30:42 ci-ln-9pxrvct-72292-wmpkf-worker-b-cw2lb systemd[1]: crio-add09cca3149cd8d19feb07b09f4d9a7b6c19bda976d500b707052d935eb215b.scope: Stopping timed out. Killing.
May 02 16:30:42 ci-ln-9pxrvct-72292-wmpkf-worker-b-cw2lb systemd[1]: crio-add09cca3149cd8d19feb07b09f4d9a7b6c19bda976d500b707052d935eb215b.scope: Killing process 6196 (promtail) with signal SIGKILL.
May 02 16:30:42 ci-ln-9pxrvct-72292-wmpkf-worker-b-cw2lb conmon[6109]: conmon add09cca3149cd8d19fe <ninfo>: container 6196 exited with status 137
May 02 16:30:42 ci-ln-9pxrvct-72292-wmpkf-worker-b-cw2lb systemd[1]: crio-add09cca3149cd8d19feb07b09f4d9a7b6c19bda976d500b707052d935eb215b.scope: Failed with result 'timeout'.

I captured a sosreport on a node after the reboot:
https://drive.google.com/file/d/1cu7s87EniEDbGosKfp1U5k2ztf2T_S4Y/view?usp=share_link

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Spin up a 4.14 nightly from the most recent week or so
2. Run a MachineConfig update
3. Watching journal logs

Actual results:

promtail is SIGKILL'ed

Expected results:

promtail does not require SIGKILL'ing to terminate

Additional info:

Marked critical since this is blocking our CI. Please move to the correct component if this isn't the right place

Assignee:: Devan Goodwin

Reporter:: Yu Qi Zhang

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/05/02 6:08 PM

Updated:: 2025/07/26 11:44 PM

Resolved:: 2023/05/31 11:09 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates