-
Bug
-
Resolution: Won't Do
-
Normal
-
None
-
4.13, 4.12, 4.14
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Currently, the hypershift operator doesn't recognize cases where an installation is not possible.
OCM reports the below cluster as ready
ID: 24bepknkvhgso8i51nugsj0bg8g1buah External ID: a9fa0fc9-ce8f-49a7-9a16-a14afc938228 Name: qe-hp-63164-zli State: ready API URL: https://api.qe-hp-63164-zli.bqe3.p3.openshiftapps.com:443 API Listening: internal Console URL: Masters: 0 Infra: 0 Computes: 2 Product: rosa Provider: aws Version: Region: us-west-2 Multi-az: true CCS: true Subnet IDs: [subnet-09342d09aee84cf2b] PrivateLink: true STS: true Existing VPC: true Channel Group: stable Cluster Admin: true Organization: Red Hat1 Creator: rh-ee-zxiao Email: zxiao@redhat.com AccountNumber: 5910538 Created: 2023-06-14T03:58:13Z Expiration: 0001-01-01T00:00:00Z Management Cluster: hs-mc-aspeu2bog Service Cluster: hs-sc-aspeu1tig
The hostedCluster progress is Partial
oc get hostedcluster -A |grep 24bepknkvhgso8i51nugsj0bg8g1buah ocm-production-24bepknkvhgso8i51nugsj0bg8g1buah qe-hp-63164-zli qe-hp-63164-zli-admin-kubeconfig Partial True False The hosted control plane is available
A look at the Hostedcluster shows that nodes are unable to join the cluster
However, HO keeps retrying to spin up nodes indefinitely (deleting every 20min and retrying):
ocm-production-24bepknkvhgso8i51nugsj0bg8g1buah-qe-hp-63164-zli qe-hp-63164-zli-workers-864755945d-jfvxx 24bepknkvhgso8i51nugsj0bg8g1buah aws:///us-west-2a/i-02dc62b15b470dd15 Provisioned 16m 4.12.19 ocm-production-24bepknkvhgso8i51nugsj0bg8g1buah-qe-hp-63164-zli qe-hp-63164-zli-workers-864755945d-tf87d 24bepknkvhgso8i51nugsj0bg8g1buah aws:///us-west-2a/i-0be1f3e4229bb29c8 Provisioned 15m 4.12.19
The limited permissions in the CU's account prevented us from figuring out why the nodes wouldn't join the cluster.
However, in a similar case on a cluster we had access to, we found that deleting the HostedZones `cluster.hypershift.local` after it had been created resulted in this situation. The node would not be able to contact the ignition server and never join the cluster.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
HO tries indefinitely to create nodes that can't join the cluster and never stops trying nor surface an error.
Expected results:
- HO has a max retry for spinning up nodes
- HO detects a cluster stuck in a provisioning loop
- For detected cases, HO verifies the status of resources it has created and updates the hostedcluster status accordingly
- Failed provisioning gets surfaced to OCM
Additional info: