Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-44740

Labels are being automatically injected by the kubelet resource

XMLWordPrintable

    • Low
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The labels are being automatically injected by the kubelet resource.

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Steps to Reproduce:

      1. Created three separate machine sets, each corresponding to a specific zone: nova, nova-az2, and nova-az3.
      
      2. Recreated the worker nodes, ensuring an even distribution across the machine sets.
      
      3. We confirmed that node affinity was applied to the existing volumes, as shown below:
      =====
      nodeAffinity:
        required:
          nodeSelectorTerms:
            - matchExpressions:
                - key: topology.cinder.csi.openstack.org/zone
                  operator: In
                  values:
                    - nova
            - matchExpressions:
                - key: topology.cinder.csi.openstack.org/zone
                  operator: In
                  values:
                    - nova-az3
      =====
      
      4. If a pod is scheduled on a worker node within the nova-az2 availability zone, the following error occurs:
      ====
      0/11 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 7 Insufficient memory. preemption: 0/11 nodes are available: 4 Preemption is not helpful for scheduling, 7 No preemption victims found for incoming pod..
      ====
      
      5. To resolve this issue, we needed to update the node label to match the node affinity of the volume, changing:
      ======
      labels:
        topology.cinder.csi.openstack.org/zone: nova-az2   <-- Change to nova or nova-az3
      ======
      
      6. Once this change was made, the pod transitioned to a running state, and the volume was successfully attached. However, the node-registrar container in the "openstack-cinder-csi-driver-node-" pod then entered an error state with the following message:
      ======
      plugin registration failed with err: error updating Node object with CSI driver node info: error updating node: timed out waiting for the condition; caused by: detected topology value collision: driver reported "topology.cinder.csi.openstack.org/zone":"nova-az2" but existing label is "topology.cinder.csi.openstack.org/zone":"nova".
      ======
      
      7. To address this, we had to revert the node label to its previous state.
      
      8. The issue persists as pods will remain in the running state until they are rescheduled onto a different worker node from the nova-az2 zone, or until the nodes are rebooted during a cluster upgrade or crash recovery.     

      Actual results:

          

      Expected results:

          

      Additional info:

          

              rhn-gps-mbooth Matthew Booth
              rhn-support-skohli Shubh Kohli
              Sunil Choudhary Sunil Choudhary
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: