Loading...

XML

Word

Printable

Type: Bug
Resolution: Can't Do
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.13.0
Component/s: RHCOS
Labels:
- coreos-afterburn
- osintegration

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Sometimes while getting OCP cluster from OpenShift CI we get a cluster with one worker node/machine being not available. Machine's `status.phase` shows up as `Provisioned` and never enters `Running` state.
While looking at AWS instance logs for the machine, I find this instance has the following error that others don't have.
```
Failed to start Afterburn (Metadata).
```
The impact is that our product (Advanced Cluster Security) fails to be deployed on such a cluster due to reduced capacity and our tests fail.

Version-Release number of selected component (if applicable):

OCP 4.13 rc 4

We're in the process of moving to a newer RC, but that's not completed yet.
Obviously, we'll switch jobs to the final released version once it's out.

How reproducible:

It does not happen every time, but occasionally. Below I provide links to failed jobs that I collected, but the actual number of failures should be growing on the daily basis.

Steps to Reproduce:

Jobs failed provisioning worker nodes
1. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1655991557515382784
2. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5944/pull-ci-stackrox-stackrox-master-ocp-4-13-ui-e2e-tests/1655097481312079872
3. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1655991557515382784
4. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1656017314207764480
5. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1656575225518624768

Job failed to provision master node
6. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-operator-e2e-tests/1658383638955298816
7. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/6044/pull-ci-stackrox-stackrox-release-3.73-ocp-4-13-qa-e2e-tests/1658547785030438912

Actual results:

One worker machine doesn't enter Running state.

Expected results:

All machines enter Running state.

Additional info:

Please see this Slack thread where I described the issue and gave more context https://redhat-internal.slack.com/archives/C999USB0D/p1684234252699759

Here are more things from the template.

OCP Version at Install Time: 4.13.0-rc.4
RHCOS Version at Install Time: not sure
OCP Version after Upgrade (if applicable): n/a (not an upgrade)
RHCOS Version after Upgrade (if applicable): n/a
Platform (AWS, Azure, bare metal, GCP, vSphere, etc.): AWS
Architecture (x86_64, ppc64le, s390x, etc.): x86_64

If you're having problems booting/installing RHCOS, please provide:

Failing machine AWS instance log is here https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1655991557515382784/artifacts/qa-e2e-tests/gather-aws-console/artifacts/i-00eb96658ef3fbc4c
Other related things may be nearby, in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1655991557515382784/artifacts/qa-e2e-tests/gather-aws-console/artifacts/ but frankly I don't know how files are laid out by OSCI.

If you're having problems post-upgrade, please provide: n/a

If you're having SELinux related issues, please provide: doesn't seem to apply

Please add anything else that might be useful, for example: not sure I have it

Assignee:: Unassigned

Reporter:: Misha Sugakov

Need Info From:: None

Contributors:: None

QA Contact:: Michael Nguyen

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/05/16 6:51 PM

Updated:: 2025/07/26 11:33 PM

Resolved:: 2024/04/30 9:04 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates