Description of problem:
Running OpenShift Container Platform 4.15 or later on AWS for example with ClusterAutoscaler configured. When creating many pod with generic ephemeral volumes (https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes).) configured or a single pod with a massive amount of such volumes configured, it can be observed that the pod is assigned to a newly created OpenShift Container Platform 4 - Node before the CSINode object is ready/available, causing the pod to get stuck in ContainerCreating state, because volumes remain in attaching state as the overall limit of volumes allowed on the specific OpenShift Container Platform 4 - Node is exceeded. Due to the late readiness of CSINode object, the volume limitation is not enforced and the kube-scheduler does consider the Node just created feasible for the pod and therefore is scheduling it on the same with the effect that the pod will never be able to get created and therefore remain stuck. Causing a newly created Node to be completely useless as the workload scheduled on the Node can't run and provisioned volumes remain in attaching state.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.15 and later
How reproducible:
Always
Steps to Reproduce:
1. Create an OpenShift Container Platform 4 - Cluster on AWS and configure Cluster Autoscaler according to https://docs.openshift.com/container-platform/4.15/machine_management/applying-autoscaling.html 2. Schedule the pod attached in pod.yaml in a specific namespace to trigger scale-up as the pod should not fit on any of the available Node(s) 3. Wait and see how the pod is scheduled on the OpenShift Container Platform 4 - Node shortly after the Node is reporting Ready state
Actual results:
27m Warning FailedScheduling pod/my-hostname-ln9df 0/6 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "my-hostname-ln9df-scratch-volume-1". preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.. 25m Warning FailedScheduling pod/my-hostname-ln9df 0/6 nodes are available: 3 node(s) exceed max volume count, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 3 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.. 27m Normal TriggeredScaleUp pod/my-hostname-ln9df pod triggered scale-up: [{MachineSet/openshift-machine-api/foobar-85xvk-worker-us-west-1b 1->2 (max: 12)}] 24m Warning FailedScheduling pod/my-hostname-ln9df 0/7 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 3 node(s) exceed max volume count, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/7 nodes are available: 3 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling.. 24m Normal NotTriggerScaleUp pod/my-hostname-ln9df pod didn't trigger scale-up: 1 node(s) didn't find available persistent volumes to bind 23m Normal Scheduled pod/my-hostname-ln9df Successfully assigned project-200/my-hostname-ln9df to ip-10-0-65-208.us-west-1.compute.internal 23m Normal SuccessfulAttachVolume pod/my-hostname-ln9df AttachVolume.Attach succeeded for volume "pvc-e529e97d-5186-40cf-8c7e-2fc983acf596" 23m Normal SuccessfulAttachVolume pod/my-hostname-ln9df AttachVolume.Attach succeeded for volume "pvc-6fe6043c-9a8e-4ce8-a1fd-8ae864c584f8" 23m Normal SuccessfulAttachVolume pod/my-hostname-ln9df AttachVolume.Attach succeeded for volume "pvc-6f5729fb-4cf1-4921-9631-0c9451199757" 23m Normal SuccessfulAttachVolume pod/my-hostname-ln9df AttachVolume.Attach succeeded for volume "pvc-c7950502-6c7e-40ff-b4a3-4d5c66c0fd5e" 23m Normal SuccessfulAttachVolume pod/my-hostname-ln9df AttachVolume.Attach succeeded for volume "pvc-2d444065-2ed3-40e1-82b1-33017540bf25" 23m Normal SuccessfulAttachVolume pod/my-hostname-ln9df AttachVolume.Attach succeeded for volume "pvc-e7e07d99-08f1-4800-841c-fe71b7672774" 23m Normal SuccessfulAttachVolume pod/my-hostname-ln9df AttachVolume.Attach succeeded for volume "pvc-32c36f5c-ee0a-418f-bfd0-828cc81159d5" 23m Normal SuccessfulAttachVolume pod/my-hostname-ln9df AttachVolume.Attach succeeded for volume "pvc-1f1bfda9-2e0b-4eaf-9ed5-342fe85f34cf" 23m Normal SuccessfulAttachVolume pod/my-hostname-ln9df AttachVolume.Attach succeeded for volume "pvc-730da79e-9a15-4f0d-85da-520d65945753" 23m Normal SuccessfulAttachVolume pod/my-hostname-ln9df (combined from similar events): AttachVolume.Attach succeeded for volume "pvc-d3e20228-396e-46dc-8752-1c87305e0d26" 2m22s Warning FailedAttachVolume pod/my-hostname-ln9df AttachVolume.Attach failed for volume "pvc-1dafe8d0-942c-46ee-9e75-eef46d532a06" : rpc error: code = Internal desc = Could not attach volume "vol-0999f65dceb94868a" to node "i-09f01efff96b2f7d4": attachment of disk "vol-0999f65dceb94868a" failed, expected device to be attached but was attaching So the pod does trigger scale-up in the configured MachineSet but then suddenly gets scheduled, even though the CSINode object only allows 26 volumes to be attached (while the pod is requesting 27) $ oc get csinode ip-10-0-65-208.us-west-1.compute.internal -o yaml apiVersion: storage.k8s.io/v1 kind: CSINode metadata: annotations: storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/azure-file,kubernetes.io/cinder,kubernetes.io/gce-pd,kubernetes.io/vsphere-volume creationTimestamp: "2024-09-24T06:14:37Z" name: ip-10-0-65-208.us-west-1.compute.internal ownerReferences: - apiVersion: v1 kind: Node name: ip-10-0-65-208.us-west-1.compute.internal uid: ef47f992-0af1-4577-bb45-53098eb2f9af resourceVersion: "3689743" uid: 08e7bfae-d5c6-4e98-92f5-7657ca31038c spec: drivers: - allocatable: count: 26 name: ebs.csi.aws.com nodeID: i-09f01efff96b2f7d4 topologyKeys: - topology.ebs.csi.aws.com/zone - name: efs.csi.aws.com nodeID: i-09f01efff96b2f7d4 topologyKeys: null
Expected results:
The pod should never be scheduled and remain in Pending state as non of the newly created OpenShift Container Platform 4 - Node will be able to satisfy the requirements of being able to attach 27 volumes. For pods with less volumes requested, it should trigger additional scale-up of Node(s) to always respect the CSINode allocatable value.
Additional info:
There are various details to be considered, but key is that CSINode object must be ready, when the Node is reporting Ready state as otherwise scheduling decisions are going to happen that are not appropriate or correct.