Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42358

Pod with volume request assigned to Node before CSINode is available/ready

XMLWordPrintable

    • Important
    • None
    • True
    • Hide

      None

      Show
      None

      Description of problem:

      Running OpenShift Container Platform 4.15 or later on AWS for example with ClusterAutoscaler configured. When creating many pod with generic ephemeral volumes (https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes).) configured or a single pod with a massive amount of such volumes configured, it can be observed that the pod is assigned to a newly created OpenShift Container Platform 4 - Node before the CSINode object is ready/available, causing the pod to get stuck in ContainerCreating state, because volumes remain in attaching state as the overall limit of volumes allowed on the specific OpenShift Container Platform 4 - Node is exceeded.
      
      Due to the late readiness of CSINode object, the volume limitation is not enforced and the kube-scheduler does consider the Node just created feasible for the pod and therefore is scheduling it on the same with the effect that the pod will never be able to get created and therefore remain stuck.
      
      Causing a newly created Node to be completely useless as the workload scheduled on the Node can't run and provisioned volumes remain in attaching state.
      

      Version-Release number of selected component (if applicable):

      OpenShift Container Platform 4.15 and later
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Create an OpenShift Container Platform 4 - Cluster on AWS and configure Cluster Autoscaler according to https://docs.openshift.com/container-platform/4.15/machine_management/applying-autoscaling.html
      2. Schedule the pod attached in pod.yaml in a specific namespace to trigger scale-up as the pod should not fit on any of the available Node(s)
      3. Wait and see how the pod is scheduled on the OpenShift Container Platform 4 - Node shortly after the Node is reporting Ready state
      

      Actual results:

      27m         Warning   FailedScheduling         pod/my-hostname-ln9df                                       0/6 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "my-hostname-ln9df-scratch-volume-1". preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
      25m         Warning   FailedScheduling         pod/my-hostname-ln9df                                       0/6 nodes are available: 3 node(s) exceed max volume count, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 3 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling..
      27m         Normal    TriggeredScaleUp         pod/my-hostname-ln9df                                       pod triggered scale-up: [{MachineSet/openshift-machine-api/foobar-85xvk-worker-us-west-1b 1->2 (max: 12)}]
      24m         Warning   FailedScheduling         pod/my-hostname-ln9df                                       0/7 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 3 node(s) exceed max volume count, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/7 nodes are available: 3 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling..
      24m         Normal    NotTriggerScaleUp        pod/my-hostname-ln9df                                       pod didn't trigger scale-up: 1 node(s) didn't find available persistent volumes to bind
      23m         Normal    Scheduled                pod/my-hostname-ln9df                                       Successfully assigned project-200/my-hostname-ln9df to ip-10-0-65-208.us-west-1.compute.internal
      23m         Normal    SuccessfulAttachVolume   pod/my-hostname-ln9df                                       AttachVolume.Attach succeeded for volume "pvc-e529e97d-5186-40cf-8c7e-2fc983acf596"
      23m         Normal    SuccessfulAttachVolume   pod/my-hostname-ln9df                                       AttachVolume.Attach succeeded for volume "pvc-6fe6043c-9a8e-4ce8-a1fd-8ae864c584f8"
      23m         Normal    SuccessfulAttachVolume   pod/my-hostname-ln9df                                       AttachVolume.Attach succeeded for volume "pvc-6f5729fb-4cf1-4921-9631-0c9451199757"
      23m         Normal    SuccessfulAttachVolume   pod/my-hostname-ln9df                                       AttachVolume.Attach succeeded for volume "pvc-c7950502-6c7e-40ff-b4a3-4d5c66c0fd5e"
      23m         Normal    SuccessfulAttachVolume   pod/my-hostname-ln9df                                       AttachVolume.Attach succeeded for volume "pvc-2d444065-2ed3-40e1-82b1-33017540bf25"
      23m         Normal    SuccessfulAttachVolume   pod/my-hostname-ln9df                                       AttachVolume.Attach succeeded for volume "pvc-e7e07d99-08f1-4800-841c-fe71b7672774"
      23m         Normal    SuccessfulAttachVolume   pod/my-hostname-ln9df                                       AttachVolume.Attach succeeded for volume "pvc-32c36f5c-ee0a-418f-bfd0-828cc81159d5"
      23m         Normal    SuccessfulAttachVolume   pod/my-hostname-ln9df                                       AttachVolume.Attach succeeded for volume "pvc-1f1bfda9-2e0b-4eaf-9ed5-342fe85f34cf"
      23m         Normal    SuccessfulAttachVolume   pod/my-hostname-ln9df                                       AttachVolume.Attach succeeded for volume "pvc-730da79e-9a15-4f0d-85da-520d65945753"
      23m         Normal    SuccessfulAttachVolume   pod/my-hostname-ln9df                                       (combined from similar events): AttachVolume.Attach succeeded for volume "pvc-d3e20228-396e-46dc-8752-1c87305e0d26"
      2m22s       Warning   FailedAttachVolume       pod/my-hostname-ln9df                                       AttachVolume.Attach failed for volume "pvc-1dafe8d0-942c-46ee-9e75-eef46d532a06" : rpc error: code = Internal desc = Could not attach volume "vol-0999f65dceb94868a" to node "i-09f01efff96b2f7d4": attachment of disk "vol-0999f65dceb94868a" failed, expected device to be attached but was attaching
      
      So the pod does trigger scale-up in the configured MachineSet but then suddenly gets scheduled, even though the CSINode object only allows 26 volumes to be attached (while the pod is requesting 27)
      
      $ oc get csinode ip-10-0-65-208.us-west-1.compute.internal -o yaml
      apiVersion: storage.k8s.io/v1
      kind: CSINode
      metadata:
        annotations:
          storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/azure-file,kubernetes.io/cinder,kubernetes.io/gce-pd,kubernetes.io/vsphere-volume
        creationTimestamp: "2024-09-24T06:14:37Z"
        name: ip-10-0-65-208.us-west-1.compute.internal
        ownerReferences:
        - apiVersion: v1
          kind: Node
          name: ip-10-0-65-208.us-west-1.compute.internal
          uid: ef47f992-0af1-4577-bb45-53098eb2f9af
        resourceVersion: "3689743"
        uid: 08e7bfae-d5c6-4e98-92f5-7657ca31038c
      spec:
        drivers:
        - allocatable:
            count: 26
          name: ebs.csi.aws.com
          nodeID: i-09f01efff96b2f7d4
          topologyKeys:
          - topology.ebs.csi.aws.com/zone
        - name: efs.csi.aws.com
          nodeID: i-09f01efff96b2f7d4
          topologyKeys: null
      
      

      Expected results:

      The pod should never be scheduled and remain in Pending state as non of the newly created OpenShift Container Platform 4 - Node will be able to satisfy the requirements of being able to attach 27 volumes. For pods with less volumes requested, it should trigger additional scale-up of Node(s) to always respect the CSINode allocatable value.
      

      Additional info:

      There are various details to be considered, but key is that CSINode object must be ready, when the Node is reporting Ready state as otherwise scheduling decisions are going to happen that are not appropriate or correct. 
      

              hekumar@redhat.com Hemant Kumar
              rhn-support-sreber Simon Reber
              Wei Duan Wei Duan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: