Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-2046

[DataMover] CSI Snapshot taking long time leading to backup failure

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • OADP 1.2.0
    • OADP
    • False
    • Hide

      None

      Show
      None
    • False
    • ToDo
    • No
    • 0
    • 0
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown

      Description of problem:

      When attempting to perform a backup, we have encountered an  issue where the CSI snapshot process is taking an unreasonably long time to become ready. As a result, the backup operation is timing out and ultimately failing.

      Version-Release number of selected component (if applicable):

      oadp-operator-bundle-container-1.2.0-79

      How reproducible:
      It happens frequently enough.

      Steps to Reproduce:
      1. Deploy statefull application with 10 PVCs
      2. Run backup on the application namespace.
      3. Watch the VolumeSnapshot and VolumeSnapshotContent for each PVC, some may stuck without ready status.

      Actual results:

      The CSI snapshot is taking an extended period to become ready, causing backup operations to fail due to a timeout error.

      Expected results:

      The CSI snapshot should be ready within a reasonable time frame to allow for successful backup operations.

      Additional info:

      DPA:

      oc get dataprotectionapplication.oadp.openshift.io -n openshift-adp example-gcp-velero -o yaml 
      apiVersion: oadp.openshift.io/v1alpha1
      kind: DataProtectionApplication
      metadata:
        creationTimestamp: "2023-05-31T13:21:27Z"
        generation: 1
        name: example-gcp-velero
        namespace: openshift-adp
        resourceVersion: "192609"
        uid: d546a0d6-8868-4e7f-8bf4-c14f316ffa52
      spec:
        backupLocations:
        - velero:
            credential:
              key: cloud
              name: cloud-credentials
            default: true
            objectStorage:
              bucket: oadpbucket206810
              prefix: velero
            provider: gcp
        configuration:
          restic:
            enable: false
          velero:
            defaultPlugins:
            - openshift
            - csi
            - gcp
            - vsm
        features:
          dataMover:
            enable: true
            volumeOptionsForStorageClasses:
              standard-csi:
                destinationVolumeOptions:
                  moverSecurityContext: true
      status:
        conditions:
        - lastTransitionTime: "2023-05-31T13:21:27Z"
          message: Reconcile complete
          reason: Complete
          status: "True"
          type: Reconciled 

      Backup:

      oc get backup -n openshift-adp b4 -o yaml
      apiVersion: velero.io/v1
      kind: Backup
      metadata:
        annotations:
          velero.io/source-cluster-k8s-gitversion: v1.25.8+37a9a08
          velero.io/source-cluster-k8s-major-version: "1"
          velero.io/source-cluster-k8s-minor-version: "25"
        creationTimestamp: "2023-05-31T13:41:13Z"
        generation: 6
        labels:
          velero.io/storage-location: example-gcp-velero-1
        name: b4
        namespace: openshift-adp
        resourceVersion: "210808"
        uid: 2ed825be-9763-4173-9140-20b91b44a81b
      spec:
        csiSnapshotTimeout: 10m0s
        defaultVolumesToFsBackup: false
        hooks: {}
        includedNamespaces:
        - a2
        itemOperationTimeout: 1h0m0s
        metadata: {}
        storageLocation: example-gcp-velero-1
        ttl: 720h0m0s
      status:
        backupItemOperationsAttempted: 9
        backupItemOperationsCompleted: 9
        completionTimestamp: "2023-05-31T13:57:52Z"
        csiVolumeSnapshotsAttempted: 10
        csiVolumeSnapshotsCompleted: 10
        errors: 2
        expiration: "2023-06-30T13:41:13Z"
        formatVersion: 1.1.0
        phase: PartiallyFailed
        startTimestamp: "2023-05-31T13:41:13Z"
        version: 1 

      VolumeSnapshot:

      oc get volumesnapshot  velero-cassandra-data-cassandra-4-86dkr -o yaml
      apiVersion: snapshot.storage.k8s.io/v1
      kind: VolumeSnapshot
      metadata:
        creationTimestamp: "2023-05-31T13:43:18Z"
        finalizers:
        - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
        generateName: velero-cassandra-data-cassandra-4-
        generation: 1
        labels:
          velero.io/backup-name: b4
        name: velero-cassandra-data-cassandra-4-86dkr
        namespace: a2
        resourceVersion: "203525"
        uid: f6ac8f42-910a-4980-af54-161c9e280a48
      spec:
        source:
          persistentVolumeClaimName: cassandra-data-cassandra-4
        volumeSnapshotClassName: csi-gcp-vsc
      status:
        boundVolumeSnapshotContentName: snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48
        creationTime: "2023-05-31T13:43:19Z"
        error:
          message: 'Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated
            annotation on the content snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48:
            "snapshot controller failed to update snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48
            on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io
            \"snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48\": the object has been modified;
            please apply your changes to the latest version and try again"'
          time: "2023-05-31T13:43:29Z"
        readyToUse: false
        restoreSize: 1Gi 

      VolumeSnapshotContent:

      oc get vsc snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 -o yaml
      apiVersion: snapshot.storage.k8s.io/v1
      kind: VolumeSnapshotContent
      metadata:
        creationTimestamp: "2023-05-31T13:43:18Z"
        finalizers:
        - snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection
        generation: 1
        labels:
          velero.io/backup-name: b4
        name: snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48
        resourceVersion: "203530"
        uid: 93d329cf-d258-493f-8e38-c61c263bf819
      spec:
        deletionPolicy: Retain
        driver: pd.csi.storage.gke.io
        source:
          volumeHandle: projects/openshift-qe/zones/us-central1-c/disks/pvc-68c4bee0-f90d-4c99-8c03-3410aa1d7fb4
        volumeSnapshotClassName: csi-gcp-vsc
        volumeSnapshotRef:
          apiVersion: snapshot.storage.k8s.io/v1
          kind: VolumeSnapshot
          name: velero-cassandra-data-cassandra-4-86dkr
          namespace: a2
          resourceVersion: "203388"
          uid: f6ac8f42-910a-4980-af54-161c9e280a48
      status:
        creationTime: 1685540599542000000
        error:
          message: 'Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated
            annotation on the content snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48:
            "snapshot controller failed to update snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48
            on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io
            \"snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48\": the object has been modified;
            please apply your changes to the latest version and try again"'
          time: "2023-05-31T13:43:29Z"
        readyToUse: false
        restoreSize: 1073741824
        snapshotHandle: projects/openshift-qe/global/snapshots/snapshot-f6ac8f42-910a-4980-af54-161c9e280a48 

      Velero logs:

      time="2023-05-31T13:53:13Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 to have snapshot handle and be ready. Retrying in 5s" backup=openshift-adp/b4 cmd=/plugins/velero-plugin-for-vsm logSource="/remote-source/app/internal/util/util.go:464" pluginName=velero-plugin-for-vsm
      time="2023-05-31T13:53:18Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 to have snapshot handle and be ready. Retrying in 5s" backup=openshift-adp/b4 cmd=/plugins/velero-plugin-for-vsm logSource="/remote-source/app/internal/util/util.go:464" pluginName=velero-plugin-for-vsm
      time="2023-05-31T13:53:23Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 to have snapshot handle and be ready. Retrying in 5s" backup=openshift-adp/b4 cmd=/plugins/velero-plugin-for-vsm logSource="/remote-source/app/internal/util/util.go:464" pluginName=velero-plugin-for-vsm
      time="2023-05-31T13:53:28Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 to have snapshot handle and be ready. Retrying in 5s" backup=openshift-adp/b4 cmd=/plugins/velero-plugin-for-vsm logSource="/remote-source/app/internal/util/util.go:464" pluginName=velero-plugin-for-vsm
      time="2023-05-31T13:53:28Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 to have snapshot handle and be ready. Retrying in 5s" backup=openshift-adp/b4 cmd=/plugins/velero-plugin-for-vsm logSource="/remote-source/app/internal/util/util.go:464" pluginName=velero-plugin-for-vsm
      time="2023-05-31T13:53:28Z" level=error msg="Timed out awaiting reconciliation of volumesnapshotcontent snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48" backup=openshift-adp/b4 cmd=/plugins/velero-plugin-for-vsm logSource="/remote-source/app/internal/util/util.go:472" pluginName=velero-plugin-for-vsm
      time="2023-05-31T13:53:28Z" level=info msg="1 errors encountered backup up item" backup=openshift-adp/b4 logSource="/remote-source/velero/app/pkg/backup/backup.go:431" name=cassandra-4
      time="2023-05-31T13:53:28Z" level=error msg="Error backing up item" backup=openshift-adp/b4 error="error executing custom action (groupResource=volumesnapshotcontents.snapshot.storage.k8s.io, namespace=, name=snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48): rpc error: code = Unknown desc = timed out waiting for the condition" logSource="/remote-source/velero/app/pkg/backup/backup.go:435" name=cassandra-4
      time="2023-05-31T13:53:28Z" level=info msg="Backed up 26 items out of an estimated total of 211 (estimate will change throughout the backup)" backup=openshift-adp/b4 logSource="/remote-source/velero/app/pkg/backup/backup.go:397" name=cassandra-4 namespace=a2 progress= resource=pods
       

            wnstb Wes Hayutin
            sbahar Shahaf Bahar
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: