Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-49406

RHOCP 4.16 upgrade blocker - kubernetes-sigs#3015 cherry-pick request for the vsphere-csi-driver

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 4.16.z
    • Storage / Operators
    • Moderate
    • None
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required
    • Done

      Description of problem:

      • In the context of the ListVolumes optimizations #2249 delivered with the RHOCP 4.16 rebase on vSphere CSI Driver v3.1.2, ListVolumes() is being called every minute in RHOCP 4.16 clusters.
        • EDIT, for a brief correction: the PR #2249 above seems to be the Workload Control Plane (WCP) implementation, and PR #2276 is the vanilla controller equivalent change that concerns this bug.
      • Bug priority set to Critical as this issue is a blocker for updating over 55 RHOCP clusters from 4.14 to 4.16.

      Version-Release number of selected component (if applicable):

      4.16 and newer

      How reproducible:

      Always

      Steps to Reproduce:

      • Deploy a 4.16.z-stream cluster with thin-csi storage class and watch vmware-vsphere-csi-driver-controller -c csi-driver logs for recurrent ListVolumes() operations on every vSphere CSI Driver CNS volume.

      Actual results:

      • ListVolumes() is being called every minute in RHOCP 4.16 clusters
      • In a context where the customer has over +3000 CNS volumes provisioned and aims to upgrade a +55 RHOCP cluster fleet to 4.16, more than +3000 API calls are being sent every minute to the vCenter API, overloading it and impacting core operations (i.e. stalling volume provisioning, volume deletion, volume updates, etc.)

      Expected results:

      • A fix for this has been seemingly already brought upstream as part of kubernetes-sigs#3015, but seemingly has yet to be implemented in an upstream driver 3.y.z version
      • Therefore, the expectation of this bug if for kubernetes-sigs#3015, merged into the latest RHOCP 4 branch and backported to a 4.16.z-stream

      Additional info:

      • Tentative workaround has been shared with the customer:
        $ oc --context sharedocp416-sbr patch clustercsidriver csi.vsphere.vmware.com --type merge -p "{\"spec\":{\"managementState\":\"Unmanaged\"}}"
        $ oc --context sharedocp416-sbr -n openshift-cluster-csi-drivers get deploy/vmware-vsphere-csi-driver-controller -o json | jq -r '.spec.template.spec.containers[] | select(.name == "csi-attacher").args'
        [
          "--csi-address=$(ADDRESS)",
          "--timeout=300s",
          "--http-endpoint=localhost:8203",
          "--leader-election",
          "--leader-election-lease-duration=137s",
          "--leader-election-renew-deadline=107s",
          "--leader-election-retry-period=26s",
          "--v=2"
          "--reconcile-sync=10m"   <<----------------- ADD THE INCREASED RSYNC INTERVAL
        ]
        

              rh-ee-mpatlaso Maxim Patlasov
              rhn-support-rsandu Robert Sandu
              Wei Duan Wei Duan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: