Loading...

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Cloud Compute / Cluster Autoscaler
Labels:
- autoscaler
- aws

Severity:
Important
Regression:
No
Sprint:
CLOUD Sprint 239, CLOUD Sprint 240, CLOUD Sprint 241, CLOUD Sprint 242, CLOUD Sprint 243, CLOUD Sprint 244
sprint_count:
6
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Customer Impact:

Customer Escalated

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

GPU enabled pod is failing to scale up the nodes using the cluster autoscaler
Could see the following logs from the machine auto scaler pods,

2023-05-23T08:49:52.701980777Z I0523 08:49:52.701962       1 non_csi.go:241] "Could not get a CSINode object for the node" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392" err="csinode.storage.k8s.io \"template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392\" not found"
2023-05-23T08:49:52.702048033Z I0523 08:49:52.702025       1 csi.go:99] "Could not get a CSINode object for the node" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392" err="csinode.storage.k8s.io \"template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392\" not found"
2023-05-23T08:49:52.702079470Z I0523 08:49:52.702073       1 binder.go:257] "FindPodVolumes" pod="aayyad-jupyter/jupyter-nb-vwfjtc4-0" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392"
2023-05-23T08:49:52.702115215Z I0523 08:49:52.702092       1 binder.go:782] "Could not get a CSINode object for the node" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392" err="csinode.storage.k8s.io \"template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392\" not found"
2023-05-23T08:49:52.702200047Z I0523 08:49:52.702178       1 binder.go:802] "PersistentVolume and node mismatch for pod" PV="pvc-6292e6b5-d261-4b3a-a327-88d12150edbd" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392" pod="aayyad-jupyter/jupyter-nb-vwfjtc4-0" err="no matching NodeSelectorTerms"
2023-05-23T08:49:52.702223300Z I0523 08:49:52.702206       1 scale_up.go:299] Pod jupyter-nb-vwfjtc4-0 can't be scheduled on MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
2023-05-23T08:49:52.702234853Z I0523 08:49:52.702226       1 scale_up.go:458] No pod can fit to MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b

Version-Release number of selected component (if applicable):

OCP version: 4.12.16

Actual results:

The node scale up is not working as expected and the pod stays back in pending state
$ omg get pods
NAME                               READY  STATUS   RESTARTS  AGE
fast-api-minimal-844f4c99f4-wst5h  1/1    Running  0         1d
jupyter-nb-vwfjtc4-0               0/2    Pending  0         29m <<---

Events:
23m        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  pod didn't trigger scale-up: 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 1 node(s) had no available volume zone, 3 node(s) had volume node affinity conflict, 13 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {nodeforlogging: }
22m        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  pod didn't trigger scale-up: 2 node(s) had untolerated taint {nodeforlogging: }, 12 node(s) didn't match Pod's node affinity/selector, 2 Insufficient memory, 1 Insufficient nvidia.com/gpu, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had volume node affinity conflict
20m        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  pod didn't trigger scale-up: 2 node(s) had volume node affinity conflict, 13 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had untolerated taint {nodeforlogging: }, 1 node(s) had no available volume zone
19m        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  pod didn't trigger scale-up: 1 node(s) had untolerated taint {nodeforlogging: }, 14 node(s) didn't match Pod's node affinity/selector, 1 Insufficient nvidia.com/gpu, 1 node(s) had no available volume zone, 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had volume node affinity conflict
18m        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  pod didn't trigger scale-up: 1 Insufficient memory, 12 node(s) didn't match Pod's node affinity/selector, 3 node(s) had volume node affinity conflict, 2 node(s) had untolerated taint {nodeforlogging: }, 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 1 node(s) had no available volume zone
16m        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  pod didn't trigger scale-up: 9 node(s) didn't match Pod's node affinity/selector, 2 Insufficient memory, 2 Insufficient nvidia.com/gpu, 1 node(s) had no available volume zone, 5 node(s) had volume node affinity conflict, 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had untolerated taint {nodeforlogging: }
18s        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  (combined from similar events): pod didn't trigger scale-up: 12 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {nodeforlogging: }, 2 Insufficient nvidia.com/gpu, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had volume node affinity conflict, 1 Insufficient memory
29m        Normal   SuccessfulCreate      statefulset/jupyter-nb-vwfjtc4            create Pod jupyter-nb-vwfjtc4-0 in StatefulSet jupyter-nb-vwfjtc4 successful
7m13s      Warning  FailedScheduling      notebook/jupyter-nb-vwfjtc4               Reissued from pod/jupyter-nb-vwfjtc4-0: 0/15 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 3 node(s) had untolerated taint {nodeforlogging: }, 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/15 nodes are available: 15 Preemption is not helpful for scheduling.
29m        Normal   SuccessfulCreate      notebook/jupyter-nb-vwfjtc4               Reissued from statefulset/jupyter-nb-vwfjtc4: create Pod jupyter-nb-vwfjtc4-0 in StatefulSet jupyter-nb-vwfjtc4 successful
27m        Normal   NotTriggerScaleUp     notebook/jupyter-nb-vwfjtc4               Reissued from pod/jupyter-nb-vwfjtc4-0: pod didn't trigger scale-up: 1 Insufficient nvidia.com/gpu, 1 node(s) had no available volume zone, 10 node(s) didn't match Pod's node affinity/selector, 3 node(s) had volume node affinity conflict, 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had untolerated taint {nodeforlogging: }, 3 Insufficient memory

Expected results:

Scale-up of OpenShift Container Platform 4 - Node with available GPU to be triggered and pod eventually scheduled on the newly added OpenShift Container Platform 4 - Node

Additional info:

Attaching the must gather and inspect element from the namespace where the pod is facing the issue

Slack Conversation: https://redhat-internal.slack.com/archives/CBZHF4DHC/p1685013880588029

Reference Bugs: https://issues.redhat.com/browse/OCPBUGS-13541

is related to

OCPBUGS-6979 The OpenShift autoscaler does not trigger a scale-up for a MachineAutoscaler with "minReplicas: 0" for Pods that define ephemeral-storage requests.

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates