Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-80511

[GCP] c3-baremetal instances are limited to 15 volume attachments per node

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • None
    • None
    • Storage Ecosystem
    • None
    • Product / Portfolio Work
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • Critical
    • None

      Description of problem:

      c3-baremetal instances are limited to 15 volume attachments per node.  this limitation is hard-coded in the gcp-pd driver code itself:https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/gce-pd-csi-driver/node.go#L128
      
      need to explore a way to override this by adding a label to the node:node-restriction.kubernetes.io/gke-volume-attach-limit-override=XXXhttps://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/gce-pd-csi-driver/node.go#L926-L937

      Version-Release number of selected component (if applicable):

      4.21.1

      How reproducible:

      Always

      Steps to Reproduce:

      1. create a pod with 75 pvcs or 75 pods with 1 pvc each
      2.
      3.
      

      Actual results:

      the pod stuck in pending (or only 15 pods created, pod 16 stuck on pending)

      Expected results:

      pod in running state (or 75 pods in running state)

      Additional info:

      it is possible to bypass the 15 volume per node limitation by adding:
      node-restriction.kubernetes.io/gke-volume-attach-limit-override=XXX
      label in the node to override the limitation (up to 127)

      this revlead two issues:

      1. missing RBAC (solved, see OCPBUGS-77183 )

      2. 
      gcp-pd csi driver is reading the node name as:
      test-gcp10-wf7zf-worker-c-6kgqn
      instead of:
      test-gcp10-wf7zf-worker-c-6kgqn.c.ocpstrat-1278.internal

      and shows

      I0225 21:46:52.147188       1 utils.go:82] /csi.v1.Node/NodeGetInfo called with request: 
      W0225 21:46:52.174054       1 node.go:37] Error getting node test-gcp10-wf7zf-worker-c-6kgqn: nodes "test-gcp10-wf7zf-worker-c-6kgqn" not found, retrying...
      W0225 21:46:53.185924       1 node.go:37] Error getting node test-gcp10-wf7zf-worker-c-6kgqn: nodes "test-gcp10-wf7zf-worker-c-6kgqn" not found, retrying...
      W0225 21:46:55.194148       1 node.go:37] Error getting node test-gcp10-wf7zf-worker-c-6kgqn: nodes "test-gcp10-wf7zf-worker-c-6kgqn" not found, retrying...
      W0225 21:46:59.211100       1 node.go:37] Error getting node test-gcp10-wf7zf-worker-c-6kgqn: nodes "test-gcp10-wf7zf-worker-c-6kgqn" not found, retrying...
      W0225 21:47:07.220454       1 node.go:37] Error getting node test-gcp10-wf7zf-worker-c-6kgqn: nodes "test-gcp10-wf7zf-worker-c-6kgqn" not found, retrying...
      E0225 21:47:07.220471       1 node.go:46] Failed to get node test-gcp10-wf7zf-worker-c-6kgqn after retries: timed out waiting for the condition
      W0225 21:47:07.220479       1 node.go:871] using default value due to err getting node-restriction.kubernetes.io/gke-volume-attach-limit-override: timed out waiting for the condition
       

      this is raising the following questions:
      does this FQDN naming istake into effect?

      preventing the attachment limitiation override to take effect

              rh-ee-nassouli Noam Assouline
              rh-ee-nassouli Noam Assouline
              Ahmad Hafi Ahmad Hafi
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: