Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14
Component/s: HyperShift
Labels:
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
Hypershift Sprint 257, Hypershift Sprint 258, Hypershift Sprint 259, Hypershift Sprint 260
sprint_count:
4

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

After resizing the request-serving nodes for a customer's HCP cluster, the component responsible (probably the hypershift-operator?) for adding the {{hypershift.openshift.io/cluster=${HCP_NS}}} label to the *second* new node took approximately 5 hours to do so.

This prevented the kube-apiserver deployment running on the HCP's management cluster from successfully scheduling both replicas, ultimately resulting in a degraded control-plane whose requests were handled by a singular apiserver.

Manually adding the label resulted in machine termination, as only a specific component (again, most likely the hypershift-operator?) is allowed to add this label. "Poking" the unlabelled node by adding dummy annotations to trigger a reconcile and replacing the unlabelled node's machine did not cause the label to get added.

This issue did not occur when resizing the underlying instance for the *first* request serving node; that node's label was added within a few minutes, as expected.

Version-Release number of selected component (if applicable):

HCP version: 4.14.21
MC version: 4.14.31

How reproducible:

Unknown, likely very

Steps to Reproduce:

    1. Resize an HCP-cluster's request-serving machinesets on a management-cluster
    2. Once the first machineset is resized, delete it's currently-running machine & remove the label from it's corresponding node with {{oc label node $OLD_NODE hypershift.openshift.io/cluster-}}. This first replacement node was observed to come up healthy and get re-labelled quickly  
    3. After first machine is replaced, resize the machineset corresponding to the HCP cluster's second request-serving node. Perform the same {{oc label node $OLD_NODE hypershift.openshift.io/cluster-}} command in order to remove it's request-serving label. Note that the label is not added to the second resized machine in a timely manner

Actual results:

The {{hypershift.openshift.io/cluster=${HCP_NS}}} label takes awhile (several hours) to get added to the second request-serving node

Expected results:

The relevant label is added to the request-serving node within a few minutes, to avoid single-points of failure on a client HCP cluster's control-plane

Additional info:

Assignee:: Cesar Wong

Reporter:: Trevor Nierman

Need Info From:: None

Contributors:: None

QA Contact:: Jie Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/07/17 6:35 PM

Updated:: 2025/07/22 5:38 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide