Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-33987

2 HostedClusters got the same serving node allocation[non-dynamic as well as in dynamic setup]

XMLWordPrintable

      Description of problem:

          On Running PerfScale test on staging sectors, the script creates 1 HC per minute to load up a Management Cluster to its maximum capacity(64 HC). There were 2 clusters trying to use same serving node pair and got in to a deadlock
      # oc get nodes -l osd-fleet-manager.openshift.io/paired-nodes=serving-12 
      NAME                                        STATUS   ROLES    AGE   VERSION
      ip-10-0-4-127.us-east-2.compute.internal    Ready    worker   34m   v1.27.11+d8e449a
      ip-10-0-84-196.us-east-2.compute.internal   Ready    worker   34m   v1.27.11+d8e449a
      
      Each node got assigned to 2 different cluster
      # oc get nodes -l hypershift.openshift.io/cluster=ocm-staging-2bcimf68iudmq2pctkj11os571ahutr1-mukri-dysn-0017 
      NAME                                       STATUS   ROLES    AGE   VERSION
      ip-10-0-4-127.us-east-2.compute.internal   Ready    worker   33m   v1.27.11+d8e449a
      
      # oc get nodes -l hypershift.openshift.io/cluster=ocm-staging-2bcind28698qgrugl87laqerhhb0u2c2-mukri-dysn-0019
      NAME                                        STATUS   ROLES    AGE   VERSION
      ip-10-0-84-196.us-east-2.compute.internal   Ready    worker   36m   v1.27.11+d8e449a
      
      Taints were missing on those nodes, so metric-forwarder pod from other hostedclusters got scheduled on serving nodes.
      
      # oc get pods -A -o wide | grep ip-10-0-84-196.us-east-2.compute.internal 
      ocm-staging-2bcind28698qgrugl87laqerhhb0u2c2-mukri-dysn-0019   kube-apiserver-86d4866654-brfkb                                           5/5     Running                  0                40m     10.128.48.6      ip-10-0-84-196.us-east-2.compute.internal    <none>           <none>
      ocm-staging-2bcins06s2acm59sp85g4qd43g9hq42g-mukri-dysn-0020   metrics-forwarder-6d787d5874-69bv7                                        1/1     Running                  0                40m     10.128.48.7      ip-10-0-84-196.us-east-2.compute.internal    <none>           <none>
      
      and few more

      Version-Release number of selected component (if applicable):

      MC Version 4.14.17
      HC version 4.15.10
      HO Version quay.io/acm-d/rhtap-hypershift-operator:c698d1da049c86c2cfb4c0f61ca052a0654e2fb9

      How reproducible:

      Not Always

      Steps to Reproduce:

          1. Create an MC with prod config (non-dynamic serving node)
          2. Create HCs on them at 1 HCP per minutes
          3. Cluster stuck at installing for more than 30 minutes
          

      Actual results:

      Only one replica of Kube-apiserver pods were up and the second stuck at pending state, upon checking the machine API has scaled both nodes in that machineset(serving-12) but only one got assigned(labelled). Further checking that node from one zone(serving-12a) was assigned to a specific hosted cluster(0017), and the other one(serving-12b) was assigned to a different hosted cluster(0019)

      Expected results:

      Kube-apiserver replica should be on the same machinesets and those node should be tainted.

      Additional info: Slack

          

        1. ho.log
          23.92 MB
        2. kas_unknown.yaml
          20 kB

            agarcial@redhat.com Alberto Garcia Lamela
            mukrishn@redhat.com Murali Krishnasamy
            Jie Zhao Jie Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: