Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-10218

HyperShift in-place upgrade: Upgraded instance fails to startup after reboot

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • 4.14
    • None
    • Important
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      After an in-place upgrade of an instance in a hosted cluster, the instance does not come up after the reboot initiated by the upgrade.

      Version-Release number of selected component (if applicable):

      4.14

      How reproducible:

      Only the first node upgrade in a nodepool using the HyperShift e2e tests

      Steps to Reproduce:

      1. Setup a management cluster with the 4.14/main version of the HyperShift operator.
      2. Run the in-place node upgrade test:
      bin/test-e2e \
        -test.v \
        -test.timeout=2h10m \
        -test.run=TestInPlaceUpgradeNodePool \
        --e2e.aws-credentials-file=$HOME/.aws/credentials \
        --e2e.aws-region=us-east-2 \
        --e2e.aws-zones=us-east-2a \
        --e2e.pull-secret-file=$HOME/.pull-secret \
        --e2e.base-domain=www.mydomain.com \
        --e2e.latest-release-image="registry.ci.openshift.org/ocp/release:4.14.0-0.ci-2023-03-13-110647" \
        --e2e.previous-release-image="registry.ci.openshift.org/ocp/release:4.14.0-0.ci-2023-03-08-170640" \
        --e2e.skip-api-budget \
        --e2e.aws-endpoint-access=PublicAndPrivate
      
      

      Actual results:

      The test fails waiting for a node to upgrade

      Expected results:

      The test succeeds

      Additional info:

      The upgrade successfully starts and the mcd pod runs to completion up to the point of restarting the instance. The instance does not come back up after restart.
      
      If the node upgrade is unstuck by deleting the node and the corresponding ec2 instance, the upgrade actually succeeds in the other node of the nodepool (assuming a 2-count nodepool).
      
      Attached is the instance log after restart, and the log of the mcd pod.
      
      Example of failed run: 
      https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_hypershift/2271/pull-ci-openshift-hypershift-main-e2e-aws/1634291913915895808

        1. instance.log
          64 kB
        2. mcd-pod.log
          102 kB

            mkennell@redhat.com Martin Kennelly
            cewong@redhat.com Cesar Wong
            Rio Liu Rio Liu
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: