Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Storage / Operators
Labels:
- splatteam

Severity:
Moderate
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Release Note Status:
Done
Target Version:

4.19.0
Target Backport Versions:

4.16, 4.17, 4.18.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Description of problem:

In the context of the ListVolumes optimizations #2249 delivered with the RHOCP 4.16 rebase on vSphere CSI Driver v3.1.2, ListVolumes() is being called every minute in RHOCP 4.16 clusters.
- EDIT, for a brief correction: the PR #2249 above seems to be the Workload Control Plane (WCP) implementation, and PR #2276 is the vanilla controller equivalent change that concerns this bug.
Bug priority set to Critical as this issue is a blocker for updating over 55 RHOCP clusters from 4.14 to 4.16.

Version-Release number of selected component (if applicable):

4.16 and newer

How reproducible:

Always

Steps to Reproduce:

Deploy a 4.16.z-stream cluster with thin-csi storage class and watch vmware-vsphere-csi-driver-controller -c csi-driver logs for recurrent ListVolumes() operations on every vSphere CSI Driver CNS volume.

Actual results:

ListVolumes() is being called every minute in RHOCP 4.16 clusters
In a context where the customer has over +3000 CNS volumes provisioned and aims to upgrade a +55 RHOCP cluster fleet to 4.16, more than +3000 API calls are being sent every minute to the vCenter API, overloading it and impacting core operations (i.e. stalling volume provisioning, volume deletion, volume updates, etc.)

Expected results:

A fix for this has been seemingly already brought upstream as part of kubernetes-sigs#3015, but seemingly has yet to be implemented in an upstream driver 3.y.z version
Therefore, the expectation of this bug if for kubernetes-sigs#3015, merged into the latest RHOCP 4 branch and backported to a 4.16.z-stream

Additional info:

Tentative workaround has been shared with the customer:

$ oc --context sharedocp416-sbr patch clustercsidriver csi.vsphere.vmware.com --type merge -p "{\"spec\":{\"managementState\":\"Unmanaged\"}}"
$ oc --context sharedocp416-sbr -n openshift-cluster-csi-drivers get deploy/vmware-vsphere-csi-driver-controller -o json | jq -r '.spec.template.spec.containers[] | select(.name == "csi-attacher").args'
[
  "--csi-address=$(ADDRESS)",
  "--timeout=300s",
  "--http-endpoint=localhost:8203",
  "--leader-election",
  "--leader-election-lease-duration=137s",
  "--leader-election-renew-deadline=107s",
  "--leader-election-retry-period=26s",
  "--v=2"
  "--reconcile-sync=10m"   <<----------------- ADD THE INCREASED RSYNC INTERVAL
]

blocks

OCPBUGS-49863 RHOCP 4.16 upgrade blocker - kubernetes-sigs#3015 cherry-pick request for the vsphere-csi-driver

Closed

is cloned by

OCPBUGS-49863 RHOCP 4.16 upgrade blocker - kubernetes-sigs#3015 cherry-pick request for the vsphere-csi-driver

Closed

links to

openshift/vmware-vsphere-csi-driver-operator#284: OCPBUGS-49406: Set reconcile-sync to 10 minute for ListVolume

RHEA-2024:11038 OpenShift Container Platform 4.19.z bug fix update

Assignee:: Maxim Patlasov

Reporter:: Robert Sandu

QA Contact:: Wei Duan

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/01/28 10:59 AM

Updated:: 2025/02/17 6:14 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates