Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-8691

Operands running management side missing affinity, tolerations, node selector and priority rules than the operator

    • Important
    • No
    • Storage Sprint 233
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

      Description of problem:

      In hypershift context:
      Operands managed by Operators running in the hosted control plane namespace in the management cluster do not honour affinity opinions https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/
      https://github.com/openshift/hypershift/blob/main/support/config/deployment.go#L263-L265
      
      These operands running management side should honour the same affinity, tolerations, node selector and priority rules than the operator.
      This could be done by looking at the operator deployment itself or at the HCP resource.
      
      aws-ebs-csi-driver-controller
      aws-ebs-csi-driver-operator
      csi-snapshot-controller
      csi-snapshot-webhook
      
      
      

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      Always

      Steps to Reproduce:

      1. Create a hypershift cluster.
      2. Check affinity rules and node selector of the operands above.
      3.
      

      Actual results:

      Operands missing affinity rules and node selecto

      Expected results:

      Operands have same affinity rules and node selector than the operator

      Additional info:

       

            [OCPBUGS-8691] Operands running management side missing affinity, tolerations, node selector and priority rules than the operator

            Ian Main mentioned this issue in a merge request of Service Delivery / app-interface on branch ibm_integration_bump:

            Bump IBM integration to our latest prod image.

            GitLab CEE Bot added a comment - Ian Main mentioned this issue in a merge request of Service Delivery / app-interface on branch ibm_integration_bump : Bump IBM integration to our latest prod image.

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2023:5006

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:5006

            Marking as Verified based on result of normal OCP cluster. 

            Rohit Patil added a comment - Marking as Verified based on result of normal OCP cluster. 

            increasing the priority to blocker (not for OCP, but for ROSA)

            Antoni Segura Puimedon added a comment - increasing the priority to blocker (not for OCP, but for ROSA)

            Some notes from testing:

            • Get an OCP cluster with all worker nodes in the same availability zone, e.g. get 3 replicas in us-eas1-1a and 0 in the others:
            $ oc -n openshift-machine-api scale machineset/jsafrane-1-vnqrz-worker-us-east-1a --replicas=3
            $ oc -n openshift-machine-api scale machineset/jsafrane-1-vnqrz-worker-us-east-1b --replicas=0
            $ oc -n openshift-machine-api scale machineset/jsafrane-1-vnqrz-worker-us-east-1c --replicas=0
            • Install HyperShift into it as usual, no special config needed. No special version needed either.
            • Install a guest cluster with the PR(s), i.e. the bug must be fixed there.
            • Edit the HostedCluster + add e.g. nodeSelector:
            $ oc -n clusters edit hostedcluster <your hosted cluster>
            
            ...
            spec:
              nodeSelector:
                kubernetes.io/hostname: ip-10-0-153-163.ec2.internal
            • See all hosted control plane pods getting re-created on the given node. AWS EBS CSI driver operator + driver + snapshot controller Pods should get re-created there too.
            $ oc -n clusters-jsafrane get pod -o wide
            NAME                                                  READY   STATUS    RESTARTS   AGE     IP             NODE                           NOMINATED NODE   READINESS GATES
            aws-ebs-csi-driver-controller-bfbdb85bc-g9z6s         7/7     Running   0          10m     10.129.2.63    ip-10-0-153-163.ec2.internal   <none>           <none>
            aws-ebs-csi-driver-operator-679cb46978-6vvfc          1/1     Running   0          10m     10.129.2.64    ip-10-0-153-163.ec2.internal   <none>           <none>
            cluster-storage-operator-9f5849847-cxwfv              1/1     Running   0          9m6s    10.129.2.95    ip-10-0-153-163.ec2.internal   <none>           <none>
            csi-snapshot-controller-857c664f5-zc9pz               1/1     Running   0          10m     10.129.2.65    ip-10-0-153-163.ec2.internal   <none>           <none>
            csi-snapshot-controller-operator-88b54f859-g4jt5      1/1     Running   0          9m6s    10.129.2.93    ip-10-0-153-163.ec2.internal   <none>           <none>
            csi-snapshot-webhook-6dbd87bbb4-ph6sj                 1/1     Running   0          10m     10.129.2.66    ip-10-0-153-163.ec2.internal   <none>           <none>
            
            • Similarly, label a random node with hypershift.openshift.io/cluster: <hosted control plane namespace> and clear the HostedCluster nodeSelector. All newly created pods should be scheduled on the labelled node. By removing nodeSelector from HostedCluster, all Pods in the hosted control plane will be re-created with empty nodeSelector and nodeAffinity should schedule them on the labelled node (if there is space for them there).
            $ oc label nodes <node name> hypershift.openshift.io/cluster=clusters-jsafrane
            $ oc -n clusters edit hostedcluster jsafrane
            # delete nodeSelector
            
            

             

            Jan Safranek added a comment - Some notes from testing: Get an OCP cluster with all worker nodes in the same availability zone, e.g. get 3 replicas in us-eas1-1a and 0 in the others: $ oc -n openshift-machine-api scale machineset/jsafrane-1-vnqrz-worker-us-east-1a --replicas=3 $ oc -n openshift-machine-api scale machineset/jsafrane-1-vnqrz-worker-us-east-1b --replicas=0 $ oc -n openshift-machine-api scale machineset/jsafrane-1-vnqrz-worker-us-east-1c --replicas=0 Install HyperShift into it as usual, no special config needed. No special version needed either. Install a guest cluster with the PR(s), i.e. the bug must be fixed there. Edit the HostedCluster + add e.g. nodeSelector: $ oc -n clusters edit hostedcluster <your hosted cluster> ... spec: nodeSelector:   kubernetes.io/hostname: ip-10-0-153-163.ec2.internal See all hosted control plane pods getting re-created on the given node. AWS EBS CSI driver operator + driver + snapshot controller Pods should get re-created there too. $ oc -n clusters-jsafrane get pod -o wide NAME                                                  READY   STATUS    RESTARTS   AGE     IP             NODE                           NOMINATED NODE   READINESS GATES aws-ebs-csi-driver-controller-bfbdb85bc-g9z6s         7/7     Running   0          10m     10.129.2.63    ip-10-0-153-163.ec2.internal   <none>           <none> aws-ebs-csi-driver-operator-679cb46978-6vvfc          1/1     Running   0          10m     10.129.2.64    ip-10-0-153-163.ec2.internal   <none>           <none> cluster-storage-operator-9f5849847-cxwfv              1/1     Running   0          9m6s    10.129.2.95    ip-10-0-153-163.ec2.internal   <none>           <none> csi-snapshot-controller-857c664f5-zc9pz               1/1     Running   0          10m     10.129.2.65    ip-10-0-153-163.ec2.internal   <none>           <none> csi-snapshot-controller-operator-88b54f859-g4jt5      1/1     Running   0          9m6s    10.129.2.93    ip-10-0-153-163.ec2.internal   <none>           <none> csi-snapshot-webhook-6dbd87bbb4-ph6sj                 1/1     Running   0          10m     10.129.2.66    ip-10-0-153-163.ec2.internal   <none>           <none> Similarly, label a random node with hypershift.openshift.io/cluster: <hosted control plane namespace> and clear the HostedCluster nodeSelector. All newly created pods should be scheduled on the labelled node. By removing nodeSelector from HostedCluster, all Pods in the hosted control plane will be re-created with empty nodeSelector and nodeAffinity should schedule them on the labelled node (if there is space for them there). $ oc label nodes <node name> hypershift.openshift.io/cluster=clusters-jsafrane $ oc -n clusters edit hostedcluster jsafrane # delete nodeSelector  

              rhn-engineering-jsafrane Jan Safranek
              agarcial@redhat.com Alberto Garcia Lamela
              Rohit Patil Rohit Patil
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: