Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.13, 4.12, 4.14
Component/s: HyperShift
Labels:
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Currently, the hypershift operator doesn't recognize cases where an installation is not possible.
OCM reports the below cluster as ready

ID:            24bepknkvhgso8i51nugsj0bg8g1buah
External ID:        a9fa0fc9-ce8f-49a7-9a16-a14afc938228
Name:            qe-hp-63164-zli
State:            ready
API URL:        https://api.qe-hp-63164-zli.bqe3.p3.openshiftapps.com:443
API Listening:        internal
Console URL:
Masters:        0
Infra:            0
Computes:        2
Product:        rosa
Provider:        aws
Version:
Region:            us-west-2
Multi-az:        true
CCS:            true
Subnet IDs:        [subnet-09342d09aee84cf2b]
PrivateLink:        true
STS:            true
Existing VPC:        true
Channel Group:        stable
Cluster Admin:        true
Organization:        Red Hat1
Creator:        rh-ee-zxiao
Email:            zxiao@redhat.com
AccountNumber:          5910538
Created:        2023-06-14T03:58:13Z
Expiration:        0001-01-01T00:00:00Z
Management Cluster:     hs-mc-aspeu2bog
Service Cluster:        hs-sc-aspeu1tig

The hostedCluster progress is Partial

oc get hostedcluster -A |grep 24bepknkvhgso8i51nugsj0bg8g1buah
ocm-production-24bepknkvhgso8i51nugsj0bg8g1buah   qe-hp-63164-zli             qe-hp-63164-zli-admin-kubeconfig   Partial     True        False         The hosted control plane is available

A look at the Hostedcluster shows that nodes are unable to join the cluster

However, HO keeps retrying to spin up nodes indefinitely (deleting every 20min and retrying):

ocm-production-24bepknkvhgso8i51nugsj0bg8g1buah-qe-hp-63164-zli   qe-hp-63164-zli-workers-864755945d-jfvxx    24bepknkvhgso8i51nugsj0bg8g1buah                                                aws:///us-west-2a/i-02dc62b15b470dd15   Provisioned    16m     4.12.19
ocm-production-24bepknkvhgso8i51nugsj0bg8g1buah-qe-hp-63164-zli   qe-hp-63164-zli-workers-864755945d-tf87d    24bepknkvhgso8i51nugsj0bg8g1buah                                                aws:///us-west-2a/i-0be1f3e4229bb29c8   Provisioned    15m     4.12.19

The limited permissions in the CU's account prevented us from figuring out why the nodes wouldn't join the cluster.

However, in a similar case on a cluster we had access to, we found that deleting the HostedZones `cluster.hypershift.local` after it had been created resulted in this situation. The node would not be able to contact the ignition server and never join the cluster.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

HO tries indefinitely to create nodes that can't join the cluster and never stops trying nor surface an error.

Expected results:

HO has a max retry for spinning up nodes
HO detects a cluster stuck in a provisioning loop
For detected cases, HO verifies the status of resources it has created and updates the hostedcluster status accordingly
Failed provisioning gets surfaced to OCM

Additional info:

Assignee:: Unassigned

Reporter:: Benson Ngoy

Need Info From:: None

Contributors:: None

QA Contact:: Jie Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2023/06/14 3:36 PM

Updated:: 2025/10/14 6:17 PM

Resolved:: 2025/10/14 6:17 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide