Details
-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.13.0
-
Moderate
-
No
-
False
-
Description
Description of problem:
Sometimes while getting OCP cluster from OpenShift CI we get a cluster with one worker node/machine being not available. Machine's `status.phase` shows up as `Provisioned` and never enters `Running` state. While looking at AWS instance logs for the machine, I find this instance has the following error that others don't have. ``` Failed to start Afterburn (Metadata). ``` The impact is that our product (Advanced Cluster Security) fails to be deployed on such a cluster due to reduced capacity and our tests fail.
Version-Release number of selected component (if applicable):
OCP 4.13 rc 4 We're in the process of moving to a newer RC, but that's not completed yet. Obviously, we'll switch jobs to the final released version once it's out.
How reproducible:
It does not happen every time, but occasionally. Below I provide links to failed jobs that I collected, but the actual number of failures should be growing on the daily basis.
Steps to Reproduce:
Jobs failed provisioning worker nodes 1. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1655991557515382784 2. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5944/pull-ci-stackrox-stackrox-master-ocp-4-13-ui-e2e-tests/1655097481312079872 3. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1655991557515382784 4. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1656017314207764480 5. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1656575225518624768 Job failed to provision master node 6. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-operator-e2e-tests/1658383638955298816 7. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/6044/pull-ci-stackrox-stackrox-release-3.73-ocp-4-13-qa-e2e-tests/1658547785030438912
Actual results:
One worker machine doesn't enter Running state.
Expected results:
All machines enter Running state.
Additional info:
Please see this Slack thread where I described the issue and gave more context https://redhat-internal.slack.com/archives/C999USB0D/p1684234252699759
Here are more things from the template.
OCP Version at Install Time: 4.13.0-rc.4
RHCOS Version at Install Time: not sure
OCP Version after Upgrade (if applicable): n/a (not an upgrade)
RHCOS Version after Upgrade (if applicable): n/a
Platform (AWS, Azure, bare metal, GCP, vSphere, etc.): AWS
Architecture (x86_64, ppc64le, s390x, etc.): x86_64
If you're having problems booting/installing RHCOS, please provide:
- Failing machine AWS instance log is here https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1655991557515382784/artifacts/qa-e2e-tests/gather-aws-console/artifacts/i-00eb96658ef3fbc4c
- Other related things may be nearby, in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1655991557515382784/artifacts/qa-e2e-tests/gather-aws-console/artifacts/ but frankly I don't know how files are laid out by OSCI.
If you're having problems post-upgrade, please provide: n/a
If you're having SELinux related issues, please provide: doesn't seem to apply
Please add anything else that might be useful, for example: not sure I have it