Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37216

Large delay adding label to new request-serving node

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.14
    • HyperShift
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • Hypershift Sprint 257, Hypershift Sprint 258, Hypershift Sprint 259, Hypershift Sprint 260
    • 4
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      After resizing the request-serving nodes for a customer's HCP cluster, the component responsible (probably the hypershift-operator?) for adding the {{hypershift.openshift.io/cluster=${HCP_NS}}} label to the *second* new node took approximately 5 hours to do so.
      
      This prevented the kube-apiserver deployment running on the HCP's management cluster from successfully scheduling both replicas, ultimately resulting in a degraded control-plane whose requests were handled by a singular apiserver.
      
      Manually adding the label resulted in machine termination, as only a specific component (again, most likely the hypershift-operator?) is allowed to add this label. "Poking" the unlabelled node by adding dummy annotations to trigger a reconcile and replacing the unlabelled node's machine did not cause the label to get added.
      
      This issue did not occur when resizing the underlying instance for the *first* request serving node; that node's label was added within a few minutes, as expected.
      

      Version-Release number of selected component (if applicable):

      HCP version: 4.14.21
      MC version: 4.14.31
          

      How reproducible:

      Unknown, likely very
          

      Steps to Reproduce:

          1. Resize an HCP-cluster's request-serving machinesets on a management-cluster
          2. Once the first machineset is resized, delete it's currently-running machine & remove the label from it's corresponding node with {{oc label node $OLD_NODE hypershift.openshift.io/cluster-}}. This first replacement node was observed to come up healthy and get re-labelled quickly  
          3. After first machine is replaced, resize the machineset corresponding to the HCP cluster's second request-serving node. Perform the same {{oc label node $OLD_NODE hypershift.openshift.io/cluster-}} command in order to remove it's request-serving label. Note that the label is not added to the second resized machine in a timely manner
          

      Actual results:

      The {{hypershift.openshift.io/cluster=${HCP_NS}}} label takes awhile (several hours) to get added to the second request-serving node
          

      Expected results:

      The relevant label is added to the request-serving node within a few minutes, to avoid single-points of failure on a client HCP cluster's control-plane
          

      Additional info:

      
          

              cewong@redhat.com Cesar Wong
              tnierman.openshift Trevor Nierman
              None
              None
              Jie Zhao Jie Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: