-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
4.12.z
-
Important
-
No
-
CLOUD Sprint 239, CLOUD Sprint 240, CLOUD Sprint 241, CLOUD Sprint 242, CLOUD Sprint 243, CLOUD Sprint 244
-
6
-
Rejected
-
False
-
-
Customer Escalated
Description of problem:
GPU enabled pod is failing to scale up the nodes using the cluster autoscaler Could see the following logs from the machine auto scaler pods, 2023-05-23T08:49:52.701980777Z I0523 08:49:52.701962 1 non_csi.go:241] "Could not get a CSINode object for the node" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392" err="csinode.storage.k8s.io \"template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392\" not found" 2023-05-23T08:49:52.702048033Z I0523 08:49:52.702025 1 csi.go:99] "Could not get a CSINode object for the node" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392" err="csinode.storage.k8s.io \"template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392\" not found" 2023-05-23T08:49:52.702079470Z I0523 08:49:52.702073 1 binder.go:257] "FindPodVolumes" pod="aayyad-jupyter/jupyter-nb-vwfjtc4-0" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392" 2023-05-23T08:49:52.702115215Z I0523 08:49:52.702092 1 binder.go:782] "Could not get a CSINode object for the node" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392" err="csinode.storage.k8s.io \"template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392\" not found" 2023-05-23T08:49:52.702200047Z I0523 08:49:52.702178 1 binder.go:802] "PersistentVolume and node mismatch for pod" PV="pvc-6292e6b5-d261-4b3a-a327-88d12150edbd" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392" pod="aayyad-jupyter/jupyter-nb-vwfjtc4-0" err="no matching NodeSelectorTerms" 2023-05-23T08:49:52.702223300Z I0523 08:49:52.702206 1 scale_up.go:299] Pod jupyter-nb-vwfjtc4-0 can't be scheduled on MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo= 2023-05-23T08:49:52.702234853Z I0523 08:49:52.702226 1 scale_up.go:458] No pod can fit to MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b
Version-Release number of selected component (if applicable):
OCP version: 4.12.16
Actual results:
The node scale up is not working as expected and the pod stays back in pending state $ omg get pods NAME READY STATUS RESTARTS AGE fast-api-minimal-844f4c99f4-wst5h 1/1 Running 0 1d jupyter-nb-vwfjtc4-0 0/2 Pending 0 29m <<--- Events: 23m Normal NotTriggerScaleUp pod/jupyter-nb-vwfjtc4-0 pod didn't trigger scale-up: 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 1 node(s) had no available volume zone, 3 node(s) had volume node affinity conflict, 13 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {nodeforlogging: } 22m Normal NotTriggerScaleUp pod/jupyter-nb-vwfjtc4-0 pod didn't trigger scale-up: 2 node(s) had untolerated taint {nodeforlogging: }, 12 node(s) didn't match Pod's node affinity/selector, 2 Insufficient memory, 1 Insufficient nvidia.com/gpu, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had volume node affinity conflict 20m Normal NotTriggerScaleUp pod/jupyter-nb-vwfjtc4-0 pod didn't trigger scale-up: 2 node(s) had volume node affinity conflict, 13 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had untolerated taint {nodeforlogging: }, 1 node(s) had no available volume zone 19m Normal NotTriggerScaleUp pod/jupyter-nb-vwfjtc4-0 pod didn't trigger scale-up: 1 node(s) had untolerated taint {nodeforlogging: }, 14 node(s) didn't match Pod's node affinity/selector, 1 Insufficient nvidia.com/gpu, 1 node(s) had no available volume zone, 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had volume node affinity conflict 18m Normal NotTriggerScaleUp pod/jupyter-nb-vwfjtc4-0 pod didn't trigger scale-up: 1 Insufficient memory, 12 node(s) didn't match Pod's node affinity/selector, 3 node(s) had volume node affinity conflict, 2 node(s) had untolerated taint {nodeforlogging: }, 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 1 node(s) had no available volume zone 16m Normal NotTriggerScaleUp pod/jupyter-nb-vwfjtc4-0 pod didn't trigger scale-up: 9 node(s) didn't match Pod's node affinity/selector, 2 Insufficient memory, 2 Insufficient nvidia.com/gpu, 1 node(s) had no available volume zone, 5 node(s) had volume node affinity conflict, 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had untolerated taint {nodeforlogging: } 18s Normal NotTriggerScaleUp pod/jupyter-nb-vwfjtc4-0 (combined from similar events): pod didn't trigger scale-up: 12 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {nodeforlogging: }, 2 Insufficient nvidia.com/gpu, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had volume node affinity conflict, 1 Insufficient memory 29m Normal SuccessfulCreate statefulset/jupyter-nb-vwfjtc4 create Pod jupyter-nb-vwfjtc4-0 in StatefulSet jupyter-nb-vwfjtc4 successful 7m13s Warning FailedScheduling notebook/jupyter-nb-vwfjtc4 Reissued from pod/jupyter-nb-vwfjtc4-0: 0/15 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 3 node(s) had untolerated taint {nodeforlogging: }, 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/15 nodes are available: 15 Preemption is not helpful for scheduling. 29m Normal SuccessfulCreate notebook/jupyter-nb-vwfjtc4 Reissued from statefulset/jupyter-nb-vwfjtc4: create Pod jupyter-nb-vwfjtc4-0 in StatefulSet jupyter-nb-vwfjtc4 successful 27m Normal NotTriggerScaleUp notebook/jupyter-nb-vwfjtc4 Reissued from pod/jupyter-nb-vwfjtc4-0: pod didn't trigger scale-up: 1 Insufficient nvidia.com/gpu, 1 node(s) had no available volume zone, 10 node(s) didn't match Pod's node affinity/selector, 3 node(s) had volume node affinity conflict, 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had untolerated taint {nodeforlogging: }, 3 Insufficient memory
Expected results:
Scale-up of OpenShift Container Platform 4 - Node with available GPU to be triggered and pod eventually scheduled on the newly added OpenShift Container Platform 4 - Node
Additional info:
Attaching the must gather and inspect element from the namespace where the pod is facing the issue
Slack Conversation: https://redhat-internal.slack.com/archives/CBZHF4DHC/p1685013880588029
Reference Bugs: https://issues.redhat.com/browse/OCPBUGS-13541
- is related to
-
OCPBUGS-6979 The OpenShift autoscaler does not trigger a scale-up for a MachineAutoscaler with "minReplicas: 0" for Pods that define ephemeral-storage requests.
- Closed