Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-5579

Ensure that the must-gather pod is not deployed on nodes marked as NotReady

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • oc
    • False
    • None
    • False
    • Not Selected

      RFE requested based on the OCPBUGS-29940 discussion

      Description of problem:

      Despite one node being in a NotReady state, executing oc adm must-gather results in the must-gather pods being scheduled on the 'NotReady' node. There's a toleration operator: Exists which means - An empty key with operator Exists matches all keys, values and effects which means this will tolerate everything

      $ oc get pod must-gather-nmhwq -n openshift-must-gather-9jtzz -o yaml | grep -i tolerations -A1
        tolerations:
        - operator: Exists
      

      How reproducible:

      • Increasing the loglevels
      $ oc patch kubeschedulers.operator/cluster --type=json -p '[{"op": "replace", "path": "/spec/logLevel", "value": "TraceAll" }]'
      
      • Stopping kubelet and leave the node in NotReady with taints NoExecute/NoSchedule
      $ ssh core@mno-ctlplane-0.5g-deployment.lab
      $ sudo systemctl stop kubelet
      
      $ oc get nodes
      NAME                               STATUS     ROLES                         AGE     VERSION
      mno-ctlplane-0.5g-deployment.lab   NotReady   control-plane,master,worker   7h27m   v1.27.8+4fab27b
      mno-ctlplane-1.5g-deployment.lab   Ready      control-plane,master,worker   7h27m   v1.27.8+4fab27b
      mno-ctlplane-2.5g-deployment.lab   Ready      control-plane,master,worker   7h27m   v1.27.8+4fab27b
      mno-worker-0.5g-deployment.lab     Ready      worker                        7h1m    v1.27.8+4fab27b
      mno-worker-1.5g-deployment.lab     Ready      worker                        7h1m    v1.27.8+4fab27b
      
      $ oc describe node mno-ctlplane-0.5g-deployment.lab | grep -A2 -i taint
      Taints:             node.kubernetes.io/unreachable:NoExecute
                          node.kubernetes.io/unreachable:NoSchedule
      
      • Run must-gather command which was previously executed and created the must-gather pod on the master-0 node, now in pending
      $ oc adm must-gather 
      $ oc get pod -A -o wide|grep must
      openshift-must-gather-qzctk               must-gather-jzlt6          0/2     Pending       0                15s     <none>            mno-ctlplane-0.5g-deployment.lab   <none>           <none>
      
      • Logs with the nodes elegibles and the score
      $ oc logs openshift-kube-scheduler-mno-ctlplane-1.5g-deployment.lab -f -n openshift-kube-scheduler| grep must
      ...
      I0222 17:44:52.501842       1 schedule_one.go:748] "Calculated node's final score for pod" pod="openshift-must-gather-qzctk/must-gather-jzlt6" node="mno-ctlplane-0.5g-deployment.lab" score=637
      I0222 17:44:52.501849       1 schedule_one.go:748] "Calculated node's final score for pod" pod="openshift-must-gather-qzctk/must-gather-jzlt6" node="mno-ctlplane-1.5g-deployment.lab" score=614
      I0222 17:44:52.501856       1 schedule_one.go:748] "Calculated node's final score for pod" pod="openshift-must-gather-qzctk/must-gather-jzlt6" node="mno-ctlplane-2.5g-deployment.lab" score=613
      ...
      $ oc get pod must-gather-ncv8q -n openshift-must-gather-f5mtl -o yaml | grep -i tolerations -A1
        tolerations:
        - operator: Exists
      

      As per the OCPBUGS-29940 discussion the must-gather pod's toleration to tolerate any taint is intended. However, from a customer point of view this behaviour is not expected and the troubleshooting tools should be scheduled on a ready node by default. Introducing a new flag to change this behaviour based on custom taints/tolertation will help to address this.

              gausingh@redhat.com Gaurav Singh
              rhn-support-jclaretm Jorge Claret Membrado
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: