Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 4.16.z
Affects Version/s: 4.16.z
Component/s: Storage / Kubernetes External Components
Labels:
- storage
- storage-csi

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:

4.16.z
Target Version:

4.16.z
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:
PX Technical Impact:
PX Impact Range:
PX Scheduling Request:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
Previously, the CSi snapshot controller listed all CSI volume snapshot on start with 10 second timeout. With large amount of volume snapshots in the cluster, this list could time out and the CSI snapshot controller entered endless loop of listing the snapshots and timing out.
In this update, the CSi snapshot controller lists only a single volume snapshot with 10 second timeout and thus does not enter such a crashloop with a large amount of snapshots in the cluster.
The CSI snapshot controller then uses proper pagination and timeout to process the snapshots in the cluster.

Show
Previously, the CSi snapshot controller listed all CSI volume snapshot on start with 10 second timeout. With large amount of volume snapshots in the cluster, this list could time out and the CSI snapshot controller entered endless loop of listing the snapshots and timing out. In this update, the CSi snapshot controller lists only a single volume snapshot with 10 second timeout and thus does not enter such a crashloop with a large amount of snapshots in the cluster. The CSI snapshot controller then uses proper pagination and timeout to process the snapshots in the cluster.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Cloud Platform: AWS

Component: ODF / Ceph / NooBaa – CSI Snapshot Controller

Support Case Link: [04165738](https://gss--c.vf.force.com/apex/Support#/cases/04165738)

Description: The csi-snapshot-controller ClusterOperator is in a degraded state. The pods are reporting repeated timeouts while attempting to list volumesnapshotcontents. Error observed in pod logs:

Failed to list v1 volumesnapshotcontents with error=Get "https://172.30.0.1:443/apis/snapshot.storage.k8s.io/v1/volumesnapshotcontents(https://172.30.0.1:443/apis/snapshot.storage.k8s.io/v1/volumesnapshotcontents)": context deadline exceeded
Exiting due to failure to ensure CRDs exist during startup: context deadline exceeded

Observed Behavior:

csi-snapshot-controller fails during initialization due to API server timeouts.
No obvious issues found in the API server or other cluster components.
Approximately 70,000 VolumeSnapshots exist in the cluster, which might be contributing to the load and latency experienced by the controller.

Expected Behavior: The CSI Snapshot Controller should handle large numbers of snapshot objects gracefully without degrading or timing out during startup.

Conclusion / Hypothesis: The issue appears to be due to the high volume of snapshot resources, leading to controller timeouts when interacting with the Kubernetes API server during startup.

Impact: Snapshot operations may be disrupted, and the overall state of the CSI Snapshot Controller remains degraded, which could impact backup and restore functionalities.

links to

openshift/csi-external-snapshotter#186: [release-4.16] OCPBUGS-60450: UPSTREAM: 1238: Snapshot Controller startup should not LIST all volumesnapshots

Assignee:: Jan Safranek

Reporter:: Deepesh Sharma

Need Info From:: None

Contributors:: Bharat Babbar

QA Contact:: Wei Duan

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/08/13 9:03 AM

Updated:: 2025/09/04 1:52 PM

Resolved:: 2025/09/04 1:52 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates