Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43357

Control plane pods missing tolerations specified in hypershift create cluster azure --tolerations

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • None
    • 4.18.0
    • HyperShift
    • Important
    • None
    • Hypershift Sprint 262, Hypershift Sprint 263
    • 2
    • Proposed
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Some control plane pods are not receiving the tolerations specified using the hypershift create cluster azure --toleration command.

      Steps to Reproduce:

      1. Create Azure HC with hypershift create cluster azure --toleration key=foo-bar.baz/quux,operator=Exists --toleration=key=fred,operator=Equal,value=foo,effect=NoSchedule --toleration key=waldo,operator=Equal,value=bar,effect=NoExecute,tolerationSeconds=3600 ... 
      
      2. Run the following script against the MC
      
      NAMESPACE="clusters-XXX"
      PODS="$(oc get pods -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}')"
      
      for POD in $PODS; do
        echo "Checking pod: $POD"  
        tolerations="$(oc get po -n $NAMESPACE $POD -o jsonpath='{.spec.tolerations}' | jq -c --sort-keys)"
        failed="false"
        
        if ! grep -q '"key":"foo-bar.baz/quux","operator":"Exists"' <<< "$tolerations"; then
          echo "No foo-bar.baz/quux key found" >&2
          failed="true"
        fi
        
        if ! grep -q '"effect":"NoSchedule","key":"fred","operator":"Equal","value":"foo"' <<< "$tolerations"; then
          echo "No fred key found" >&2
          failed="true"
        fi
        
        if ! grep -q '"effect":"NoExecute","key":"waldo","operator":"Equal","tolerationSeconds":3600,"value":"bar"' <<< "$tolerations"; then
          echo "No waldo key found" >&2
          failed="true"
        fi
        
        if [[ $failed == "true" ]]; then
          echo "Tolerations: "
          echo "$tolerations" | jq --sort-keys
        fi
        echo 
      done 
      3. Take note of the results 
      

      Actual results (and dump files):

      https://drive.google.com/drive/folders/1MQYihLSaK_9WDq3b-H7vx-LheSX69d2O?usp=sharing

      Expected results:

      All specified tolerations are propagated to all control plane pods. 
      

            [OCPBUGS-43357] Control plane pods missing tolerations specified in hypershift create cluster azure --tolerations

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: OpenShift Container Platform 4.18.1 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2024:6122

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.18.1 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:6122

            Hi sjenning,

            Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            OpenShift Jira Bot added a comment - Hi sjenning , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            My previous comment was extracted from the dump.

            However, when I try to recreate with a 4.18 HC on AWS, the only pod that does not have the toleration is aws-ebs-csi-driver-controller

            It appears that the operands that the CNO, CSO, and the csi-snapshot-controller all have the correct tolerations.

            It is only the CSI-specific driver operators that were not updated to deploy their controllers with the tolerations. This will need to be done on a per-CSI driver operator basis.

            Seth Jennings added a comment - My previous comment was extracted from the dump. However, when I try to recreate with a 4.18 HC on AWS, the only pod that does not have the toleration is aws-ebs-csi-driver-controller It appears that the operands that the CNO, CSO, and the csi-snapshot-controller all have the correct tolerations. It is only the CSI-specific driver operators that were not updated to deploy their controllers with the tolerations. This will need to be done on a per-CSI driver operator basis.

            For the record, all the pods that lack tolerations are deployed as second-level operands in the HCP by the CSO and CNO

            CSO deployed:
            azure-disk-csi-driver-controller
            azure-disk-csi-driver-operator
            azure-file-csi-driver-controller
            azure-file-csi-driver-operator
            csi-snapshot-controller
            csi-snapshot-webhook

            CNO deployed:
            multus-admission-controller
            cloud-network-config-controller
            network-node-identity
            ovnkube-control-plane

            Seth Jennings added a comment - For the record, all the pods that lack tolerations are deployed as second-level operands in the HCP by the CSO and CNO CSO deployed: azure-disk-csi-driver-controller azure-disk-csi-driver-operator azure-file-csi-driver-controller azure-file-csi-driver-operator csi-snapshot-controller csi-snapshot-webhook CNO deployed: multus-admission-controller cloud-network-config-controller network-node-identity ovnkube-control-plane

            Should have e2e coverage

            Antoni Segura Puimedon added a comment - Should have e2e coverage

              sjenning Seth Jennings
              fxierh Feilian Xie (Inactive)
              He Liu He Liu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: