Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-60450

4.16: csi-snapshot-controller ClusterOperator in degraded state due to API server timeout

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • Done
    • Bug Fix
    • Hide
      Previously, the CSi snapshot controller listed all CSI volume snapshot on start with 10 second timeout. With large amount of volume snapshots in the cluster, this list could time out and the CSI snapshot controller entered endless loop of listing the snapshots and timing out.
      In this update, the CSi snapshot controller lists only a single volume snapshot with 10 second timeout and thus does not enter such a crashloop with a large amount of snapshots in the cluster.
      The CSI snapshot controller then uses proper pagination and timeout to process the snapshots in the cluster.
      Show
      Previously, the CSi snapshot controller listed all CSI volume snapshot on start with 10 second timeout. With large amount of volume snapshots in the cluster, this list could time out and the CSI snapshot controller entered endless loop of listing the snapshots and timing out. In this update, the CSi snapshot controller lists only a single volume snapshot with 10 second timeout and thus does not enter such a crashloop with a large amount of snapshots in the cluster. The CSI snapshot controller then uses proper pagination and timeout to process the snapshots in the cluster.
    • None
    • None
    • None
    • None

      Cloud Platform: AWS

      Component: ODF / Ceph / NooBaa – CSI Snapshot Controller

      Support Case Link: [04165738](https://gss--c.vf.force.com/apex/Support#/cases/04165738)

      Description: The csi-snapshot-controller ClusterOperator is in a degraded state. The pods are reporting repeated timeouts while attempting to list volumesnapshotcontents. Error observed in pod logs:

      Failed to list v1 volumesnapshotcontents with error=Get "https://172.30.0.1:443/apis/snapshot.storage.k8s.io/v1/volumesnapshotcontents(https://172.30.0.1:443/apis/snapshot.storage.k8s.io/v1/volumesnapshotcontents)": context deadline exceeded
      Exiting due to failure to ensure CRDs exist during startup: context deadline exceeded

      Observed Behavior:

      • csi-snapshot-controller fails during initialization due to API server timeouts.
      • No obvious issues found in the API server or other cluster components.
      • Approximately 70,000 VolumeSnapshots exist in the cluster, which might be contributing to the load and latency experienced by the controller.

      Expected Behavior: The CSI Snapshot Controller should handle large numbers of snapshot objects gracefully without degrading or timing out during startup.

      Conclusion / Hypothesis: The issue appears to be due to the high volume of snapshot resources, leading to controller timeouts when interacting with the Kubernetes API server during startup.

      Impact: Snapshot operations may be disrupted, and the overall state of the CSI Snapshot Controller remains degraded, which could impact backup and restore functionalities.

              rhn-engineering-jsafrane Jan Safranek
              rhn-support-deepesha Deepesh Sharma
              None
              Bharat Babbar
              Wei Duan Wei Duan
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: