-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.14
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
Hypershift Sprint 257, Hypershift Sprint 258, Hypershift Sprint 259, Hypershift Sprint 260
-
4
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
After resizing the request-serving nodes for a customer's HCP cluster, the component responsible (probably the hypershift-operator?) for adding the {{hypershift.openshift.io/cluster=${HCP_NS}}} label to the *second* new node took approximately 5 hours to do so. This prevented the kube-apiserver deployment running on the HCP's management cluster from successfully scheduling both replicas, ultimately resulting in a degraded control-plane whose requests were handled by a singular apiserver. Manually adding the label resulted in machine termination, as only a specific component (again, most likely the hypershift-operator?) is allowed to add this label. "Poking" the unlabelled node by adding dummy annotations to trigger a reconcile and replacing the unlabelled node's machine did not cause the label to get added. This issue did not occur when resizing the underlying instance for the *first* request serving node; that node's label was added within a few minutes, as expected.
Version-Release number of selected component (if applicable):
HCP version: 4.14.21 MC version: 4.14.31
How reproducible:
Unknown, likely very
Steps to Reproduce:
1. Resize an HCP-cluster's request-serving machinesets on a management-cluster 2. Once the first machineset is resized, delete it's currently-running machine & remove the label from it's corresponding node with {{oc label node $OLD_NODE hypershift.openshift.io/cluster-}}. This first replacement node was observed to come up healthy and get re-labelled quickly 3. After first machine is replaced, resize the machineset corresponding to the HCP cluster's second request-serving node. Perform the same {{oc label node $OLD_NODE hypershift.openshift.io/cluster-}} command in order to remove it's request-serving label. Note that the label is not added to the second resized machine in a timely manner
Actual results:
The {{hypershift.openshift.io/cluster=${HCP_NS}}} label takes awhile (several hours) to get added to the second request-serving node
Expected results:
The relevant label is added to the request-serving node within a few minutes, to avoid single-points of failure on a client HCP cluster's control-plane
Additional info: