Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-13706

OCP 4.13 rc4 machine does not enter Running state due to Afterburn error

    XMLWordPrintable

Details

    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      Sometimes while getting OCP cluster from OpenShift CI we get a cluster with one worker node/machine being not available. Machine's `status.phase` shows up as `Provisioned` and never enters `Running` state.
      While looking at AWS instance logs for the machine, I find this instance has the following error that others don't have.
      ```
      Failed to start Afterburn (Metadata).
      ```
      The impact is that our product (Advanced Cluster Security) fails to be deployed on such a cluster due to reduced capacity and our tests fail.

      Version-Release number of selected component (if applicable):

      OCP 4.13 rc 4
      
      We're in the process of moving to a newer RC, but that's not completed yet.
      Obviously, we'll switch jobs to the final released version once it's out.

      How reproducible:

      It does not happen every time, but occasionally. Below I provide links to failed jobs that I collected, but the actual number of failures should be growing on the daily basis.

      Steps to Reproduce:

      Jobs failed provisioning worker nodes
      1. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1655991557515382784
      2. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5944/pull-ci-stackrox-stackrox-master-ocp-4-13-ui-e2e-tests/1655097481312079872
      3. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1655991557515382784
      4. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1656017314207764480
      5. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-qa-e2e-tests/1656575225518624768
      
      Job failed to provision master node
      6. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/5975/pull-ci-stackrox-stackrox-release-4.0-ocp-4-13-operator-e2e-tests/1658383638955298816
      7. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/stackrox_stackrox/6044/pull-ci-stackrox-stackrox-release-3.73-ocp-4-13-qa-e2e-tests/1658547785030438912

      Actual results:

      One worker machine doesn't enter Running state.

      Expected results:

      All machines enter Running state.

      Additional info:

      Please see this Slack thread where I described the issue and gave more context https://redhat-internal.slack.com/archives/C999USB0D/p1684234252699759

       

      Here are more things from the template.

       

      OCP Version at Install Time: 4.13.0-rc.4
      RHCOS Version at Install Time: not sure
      OCP Version after Upgrade (if applicable): n/a (not an upgrade)
      RHCOS Version after Upgrade (if applicable): n/a
      Platform (AWS, Azure, bare metal, GCP, vSphere, etc.): AWS
      Architecture (x86_64, ppc64le, s390x, etc.): x86_64

      If you're having problems booting/installing RHCOS, please provide:

      If you're having problems post-upgrade, please provide: n/a

      If you're having SELinux related issues, please provide: doesn't seem to apply

      Please add anything else that might be useful, for example: not sure I have it

      Attachments

        Activity

          People

            Unassigned Unassigned
            msugakov@redhat.com Misha Sugakov
            Michael Nguyen Michael Nguyen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: