-
Feature Request
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
False
-
None
-
False
-
Not Selected
-
-
-
RFE requested based on the OCPBUGS-29940 discussion
Description of problem:
Despite one node being in a NotReady state, executing oc adm must-gather results in the must-gather pods being scheduled on the 'NotReady' node. There's a toleration operator: Exists which means - An empty key with operator Exists matches all keys, values and effects which means this will tolerate everything
$ oc get pod must-gather-nmhwq -n openshift-must-gather-9jtzz -o yaml | grep -i tolerations -A1
tolerations:
- operator: Exists
How reproducible:
- Increasing the loglevels
$ oc patch kubeschedulers.operator/cluster --type=json -p '[{"op": "replace", "path": "/spec/logLevel", "value": "TraceAll" }]'
- Stopping kubelet and leave the node in NotReady with taints NoExecute/NoSchedule
$ ssh core@mno-ctlplane-0.5g-deployment.lab $ sudo systemctl stop kubelet
$ oc get nodes NAME STATUS ROLES AGE VERSION mno-ctlplane-0.5g-deployment.lab NotReady control-plane,master,worker 7h27m v1.27.8+4fab27b mno-ctlplane-1.5g-deployment.lab Ready control-plane,master,worker 7h27m v1.27.8+4fab27b mno-ctlplane-2.5g-deployment.lab Ready control-plane,master,worker 7h27m v1.27.8+4fab27b mno-worker-0.5g-deployment.lab Ready worker 7h1m v1.27.8+4fab27b mno-worker-1.5g-deployment.lab Ready worker 7h1m v1.27.8+4fab27b
$ oc describe node mno-ctlplane-0.5g-deployment.lab | grep -A2 -i taint Taints: node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unreachable:NoSchedule
- Run must-gather command which was previously executed and created the must-gather pod on the master-0 node, now in pending
$ oc adm must-gather $ oc get pod -A -o wide|grep must openshift-must-gather-qzctk must-gather-jzlt6 0/2 Pending 0 15s <none> mno-ctlplane-0.5g-deployment.lab <none> <none>
- Logs with the nodes elegibles and the score
$ oc logs openshift-kube-scheduler-mno-ctlplane-1.5g-deployment.lab -f -n openshift-kube-scheduler| grep must ... I0222 17:44:52.501842 1 schedule_one.go:748] "Calculated node's final score for pod" pod="openshift-must-gather-qzctk/must-gather-jzlt6" node="mno-ctlplane-0.5g-deployment.lab" score=637 I0222 17:44:52.501849 1 schedule_one.go:748] "Calculated node's final score for pod" pod="openshift-must-gather-qzctk/must-gather-jzlt6" node="mno-ctlplane-1.5g-deployment.lab" score=614 I0222 17:44:52.501856 1 schedule_one.go:748] "Calculated node's final score for pod" pod="openshift-must-gather-qzctk/must-gather-jzlt6" node="mno-ctlplane-2.5g-deployment.lab" score=613 ... $ oc get pod must-gather-ncv8q -n openshift-must-gather-f5mtl -o yaml | grep -i tolerations -A1 tolerations: - operator: Exists
As per the OCPBUGS-29940 discussion the must-gather pod's toleration to tolerate any taint is intended. However, from a customer point of view this behaviour is not expected and the troubleshooting tools should be scheduled on a ready node by default. Introducing a new flag to change this behaviour based on custom taints/tolertation will help to address this.
- is related to
-
RFE-6505 [RFE] Enhance must-gather Tool for Full Automation in OpenShift Environments
- Backlog