Description of problem:
- In the context of the ListVolumes optimizations #2249 delivered with the RHOCP 4.16 rebase on vSphere CSI Driver v3.1.2, ListVolumes() is being called every minute in RHOCP 4.16 clusters.
- EDIT, for a brief correction: the PR #2249 above seems to be the Workload Control Plane (WCP) implementation, and PR #2276 is the vanilla controller equivalent change that concerns this bug.
- Bug priority set to Critical as this issue is a blocker for updating over 55 RHOCP clusters from 4.14 to 4.16.
Version-Release number of selected component (if applicable):
4.16 and newer
How reproducible:
Always
Steps to Reproduce:
- Deploy a 4.16.z-stream cluster with thin-csi storage class and watch vmware-vsphere-csi-driver-controller -c csi-driver logs for recurrent ListVolumes() operations on every vSphere CSI Driver CNS volume.
Actual results:
- ListVolumes() is being called every minute in RHOCP 4.16 clusters
- In a context where the customer has over +3000 CNS volumes provisioned and aims to upgrade a +55 RHOCP cluster fleet to 4.16, more than +3000 API calls are being sent every minute to the vCenter API, overloading it and impacting core operations (i.e. stalling volume provisioning, volume deletion, volume updates, etc.)
Expected results:
- A fix for this has been seemingly already brought upstream as part of kubernetes-sigs#3015, but seemingly has yet to be implemented in an upstream driver 3.y.z version
- Therefore, the expectation of this bug if for kubernetes-sigs#3015, merged into the latest RHOCP 4 branch and backported to a 4.16.z-stream
Additional info:
- Tentative workaround has been shared with the customer:
$ oc --context sharedocp416-sbr patch clustercsidriver csi.vsphere.vmware.com --type merge -p "{\"spec\":{\"managementState\":\"Unmanaged\"}}" $ oc --context sharedocp416-sbr -n openshift-cluster-csi-drivers get deploy/vmware-vsphere-csi-driver-controller -o json | jq -r '.spec.template.spec.containers[] | select(.name == "csi-attacher").args' [ "--csi-address=$(ADDRESS)", "--timeout=300s", "--http-endpoint=localhost:8203", "--leader-election", "--leader-election-lease-duration=137s", "--leader-election-renew-deadline=107s", "--leader-election-retry-period=26s", "--v=2" "--reconcile-sync=10m" <<----------------- ADD THE INCREASED RSYNC INTERVAL ]