Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14962

hypershift operator stuck in provisioning loop

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Normal Normal
    • None
    • 4.13, 4.12, 4.14
    • HyperShift
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • No
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Currently, the hypershift operator doesn't recognize cases where an installation is not possible. 
      OCM reports the below cluster as ready

      ID:            24bepknkvhgso8i51nugsj0bg8g1buah
      External ID:        a9fa0fc9-ce8f-49a7-9a16-a14afc938228
      Name:            qe-hp-63164-zli
      State:            ready
      API URL:        https://api.qe-hp-63164-zli.bqe3.p3.openshiftapps.com:443
      API Listening:        internal
      Console URL:
      Masters:        0
      Infra:            0
      Computes:        2
      Product:        rosa
      Provider:        aws
      Version:
      Region:            us-west-2
      Multi-az:        true
      CCS:            true
      Subnet IDs:        [subnet-09342d09aee84cf2b]
      PrivateLink:        true
      STS:            true
      Existing VPC:        true
      Channel Group:        stable
      Cluster Admin:        true
      Organization:        Red Hat1
      Creator:        rh-ee-zxiao
      Email:            zxiao@redhat.com
      AccountNumber:          5910538
      Created:        2023-06-14T03:58:13Z
      Expiration:        0001-01-01T00:00:00Z
      Management Cluster:     hs-mc-aspeu2bog
      Service Cluster:        hs-sc-aspeu1tig

      The hostedCluster progress is Partial

       

      oc get hostedcluster -A |grep 24bepknkvhgso8i51nugsj0bg8g1buah
      ocm-production-24bepknkvhgso8i51nugsj0bg8g1buah   qe-hp-63164-zli             qe-hp-63164-zli-admin-kubeconfig   Partial     True        False         The hosted control plane is available 

      A look at the Hostedcluster shows that nodes are unable to join the cluster

       

      However, HO keeps retrying to spin up nodes indefinitely (deleting every 20min and retrying):

       

      ocm-production-24bepknkvhgso8i51nugsj0bg8g1buah-qe-hp-63164-zli   qe-hp-63164-zli-workers-864755945d-jfvxx    24bepknkvhgso8i51nugsj0bg8g1buah                                                aws:///us-west-2a/i-02dc62b15b470dd15   Provisioned    16m     4.12.19
      ocm-production-24bepknkvhgso8i51nugsj0bg8g1buah-qe-hp-63164-zli   qe-hp-63164-zli-workers-864755945d-tf87d    24bepknkvhgso8i51nugsj0bg8g1buah                                                aws:///us-west-2a/i-0be1f3e4229bb29c8   Provisioned    15m     4.12.19 

      The limited permissions in the CU's account prevented us from figuring out why the nodes wouldn't join the cluster.

       

      However, in a similar case on a cluster we had access to, we found that deleting the HostedZones  `cluster.hypershift.local` after it had been created resulted in this situation. The node would not be able to contact the ignition server and never join the cluster. 

       

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      HO tries indefinitely to create nodes that can't join the cluster and never stops trying nor surface an error. 

      Expected results:

      • HO has a max retry for spinning up nodes
      • HO detects a cluster stuck in a provisioning loop
      • For detected cases, HO verifies the status of resources it has created and updates the hostedcluster status accordingly
      • Failed provisioning gets surfaced to OCM

      Additional info:

              Unassigned Unassigned
              benson.ngoy Benson Ngoy
              None
              None
              Jie Zhao Jie Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: