Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-2871

Backup is marked as PartiallyFailed when volumeSnapshotContent CR has an error

    XMLWordPrintable

Details

    • False
    • Hide

      None

      Show
      None
    • False
    • ToDo
    • No
    • 0
    • 0
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown

    Description

      Description of problem:

      I have been seeing this issue recently in OADP 1.3.0, if any of the VolumeSnapshotContent CR has an error related to removing VolumeSnapshotBeingCreated annotation in that case it moves backup to WaitingForPluginOperationsPartiallyFailed phase.  Due to this most of the CSI/NativeDataMover backups are failing recently. 

      Usually this VolumeSnapshotContent takes 10-15 mins to become ready(In case of above error), as you in see in attached log below. 

       $ oc get vsc
      NAME                                               READYTOUSE   RESTORESIZE   DELETIONPOLICY   DRIVER                  VOLUMESNAPSHOTCLASS   VOLUMESNAPSHOT                            VOLUMESNAPSHOTNAMESPACE   AGE
      snapcontent-b87c0704-ca99-4b89-a249-320b6f70c54f   true         1073741824    Retain           pd.csi.storage.gke.io   example-snapclass     velero-cassandra-data-cassandra-0-bwxxc   ocp-cassandra             9m19s
      

       

      Version-Release number of selected component (if applicable):
      OADP 1.3.0-117

       

      How reproducible:
      Intermittent(It happens when VolumeSnapshotContent CR has error related to removing annotation)

       

      Steps to Reproduce:
      1. Create a DPA with CSI enabled.

      $ oc get dpa ts-dpa -o yaml
      apiVersion: oadp.openshift.io/v1alpha1
      kind: DataProtectionApplication
      metadata:
        creationTimestamp: "2023-10-12T07:50:55Z"
        generation: 1
        managedFields:
        - apiVersion: oadp.openshift.io/v1alpha1
          fieldsType: FieldsV1
          fieldsV1:
            f:spec:
              .: {}
              f:backupLocations: {}
              f:configuration:
                .: {}
                f:velero:
                  .: {}
                  f:defaultPlugins: {}
          manager: kubectl-create
          operation: Update
          time: "2023-10-12T07:50:55Z"
        - apiVersion: oadp.openshift.io/v1alpha1
          fieldsType: FieldsV1
          fieldsV1:
            f:status:
              .: {}
              f:conditions: {}
          manager: manager
          operation: Update
          subresource: status
          time: "2023-10-12T07:50:55Z"
        name: ts-dpa
        namespace: openshift-adp
        resourceVersion: "63211"
        uid: d1d372b7-a44a-47da-a02d-898e429b96db
      spec:
        backupLocations:
        - velero:
            default: true
            objectStorage:
              bucket: oadpbucket239326
              prefix: velero
            provider: gcp
        configuration:
          velero:
            defaultPlugins:
            - gcp
            - openshift
            - csi
      status:
        conditions:
        - lastTransitionTime: "2023-10-12T07:50:55Z"
          message: Reconcile complete
          reason: Complete
          status: "True"
          type: Reconciled

      2. VSclass

      $ oc get vsclass example-snapclass -o yaml
      apiVersion: snapshot.storage.k8s.io/v1
      deletionPolicy: Retain
      driver: pd.csi.storage.gke.io
      kind: VolumeSnapshotClass
      metadata:
        annotations:
          snapshot.storage.kubernetes.io/is-default-class: "true"
        creationTimestamp: "2023-10-12T07:50:44Z"
        generation: 1
        labels:
          velero.io/csi-volumesnapshot-class: "true"
        managedFields:
        - apiVersion: snapshot.storage.k8s.io/v1
          fieldsType: FieldsV1
          fieldsV1:
            f:deletionPolicy: {}
            f:driver: {}
            f:metadata:
              f:annotations:
                .: {}
                f:snapshot.storage.kubernetes.io/is-default-class: {}
              f:labels:
                .: {}
                f:velero.io/csi-volumesnapshot-class: {}
          manager: kubectl-create
          operation: Update
          time: "2023-10-12T07:50:44Z"
        name: example-snapclass
        resourceVersion: "63127"
        uid: fdbde537-dadd-4baa-a4f8-19edf382a3e0

      3. Deploy an application which has more than 1 PVC. 

      $ oc get pod -n ocp-cassandra
      NAME          READY   STATUS    RESTARTS      AGE
      cassandra-0   1/1     Running   0             47m
      cassandra-1   1/1     Running   0             46m
      cassandra-2   1/1     Running   2 (45m ago)   46m

       

      $ oc get pvc -n ocp-cassandra
      NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      cassandra-data-cassandra-0   Bound    pvc-9d285012-5a61-4c7e-bb8d-53424b8b4925   1Gi        RWO            standard-csi   47m
      cassandra-data-cassandra-1   Bound    pvc-4274b6b7-6465-4c2f-8173-6cba6ccb8a32   1Gi        RWO            standard-csi   47m
      cassandra-data-cassandra-2   Bound    pvc-6240bf9e-fc12-45d0-94f3-343d8cb85200   1Gi        RWO            standard-csi   46m
       

      4. Execute CSI backup

       

       

      $ oc get backup test-backup -o yaml
      apiVersion: velero.io/v1
      kind: Backup
      metadata:
        annotations:
          velero.io/resource-timeout: 10m0s
          velero.io/source-cluster-k8s-gitversion: v1.27.6+98158f9
          velero.io/source-cluster-k8s-major-version: "1"
          velero.io/source-cluster-k8s-minor-version: "27"
        creationTimestamp: "2023-10-12T08:22:06Z"
        generation: 11
        labels:
          velero.io/storage-location: ts-dpa-1
        name: test-backup
        namespace: openshift-adp
        resourceVersion: "77631"
        uid: a1575cbc-f2e1-48ca-ab6b-2774f6d41aa1
      spec:
        csiSnapshotTimeout: 10m0s
        defaultVolumesToFsBackup: false
        includedNamespaces:
        - ocp-cassandra
        itemOperationTimeout: 4h0m0s
        storageLocation: ts-dpa-1
        ttl: 720h0m0s
      status:
        backupItemOperationsAttempted: 6
        backupItemOperationsCompleted: 5
        backupItemOperationsFailed: 1
        completionTimestamp: "2023-10-12T08:30:14Z"
        csiVolumeSnapshotsAttempted: 3
        errors: 1
        expiration: "2023-11-11T08:22:06Z"
        formatVersion: 1.1.0
        phase: PartiallyFailed
        progress:
          itemsBackedUp: 85
          totalItems: 85
        startTimestamp: "2023-10-12T08:22:06Z"
        version: 1
        warnings: 1

       

      Actual results:

      Backup marked as partially Failed.

       

      $ oc logs velero-85fb774687-mhml9 | grep level=error
      time="2023-10-12T07:51:12Z" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/0/1)" controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:194"
      time="2023-10-12T08:22:41Z" level=error msg=0 backup=openshift-adp/test-backup logSource="/remote-source/velero/app/pkg/controller/backup_controller.go:722" 
      
      
      

      VSC.yaml

      apiVersion: snapshot.storage.k8s.io/v1
      kind: VolumeSnapshotContent
      metadata:
        creationTimestamp: "2023-10-12T08:22:11Z"
        finalizers:
        - snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection
        generation: 1
        labels:
          velero.io/backup-name: test-backup
        managedFields:
        - apiVersion: snapshot.storage.k8s.io/v1
          fieldsType: FieldsV1
          fieldsV1:
            f:metadata:
              f:finalizers:
                .: {}
                v:"snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection": {}
            f:spec:
              .: {}
              f:deletionPolicy: {}
              f:driver: {}
              f:source:
                .: {}
                f:volumeHandle: {}
              f:volumeSnapshotClassName: {}
              f:volumeSnapshotRef:
                .: {}
                f:apiVersion: {}
                f:kind: {}
                f:name: {}
                f:namespace: {}
                f:resourceVersion: {}
                f:uid: {}
          manager: snapshot-controller
          operation: Update
          time: "2023-10-12T08:22:11Z"
        - apiVersion: snapshot.storage.k8s.io/v1
          fieldsType: FieldsV1
          fieldsV1:
            f:status:
              .: {}
              f:creationTime: {}
              f:error:
                .: {}
                f:message: {}
                f:time: {}
              f:readyToUse: {}
              f:restoreSize: {}
              f:snapshotHandle: {}
          manager: csi-snapshotter
          operation: Update
          subresource: status
          time: "2023-10-12T08:22:21Z"
        - apiVersion: snapshot.storage.k8s.io/v1
          fieldsType: FieldsV1
          fieldsV1:
            f:metadata:
              f:labels:
                .: {}
                f:velero.io/backup-name: {}
          manager: velero-plugin-for-csi
          operation: Update
          time: "2023-10-12T08:22:21Z"
        name: snapcontent-b87c0704-ca99-4b89-a249-320b6f70c54f
        resourceVersion: "74637"
        uid: 22ddda34-e7a4-4a0c-8a7e-0b2770fbfd60
      spec:
        deletionPolicy: Retain
        driver: pd.csi.storage.gke.io
        source:
          volumeHandle: projects/openshift-qe/zones/us-central1-c/disks/pvc-9d285012-5a61-4c7e-bb8d-53424b8b4925
        volumeSnapshotClassName: example-snapclass
        volumeSnapshotRef:
          apiVersion: snapshot.storage.k8s.io/v1
          kind: VolumeSnapshot
          name: velero-cassandra-data-cassandra-0-bwxxc
          namespace: ocp-cassandra
          resourceVersion: "74531"
          uid: b87c0704-ca99-4b89-a249-320b6f70c54f
      status:
        creationTime: 1697098932063000000
        error:
          message: 'Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-b87c0704-ca99-4b89-a249-320b6f70c54f: "snapshot controller failed to update snapcontent-b87c0704-ca99-4b89-a249-320b6f70c54f on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-b87c0704-ca99-4b89-a249-320b6f70c54f\": the object has been modified; please apply your changes to the latest version and try again"'
          time: "2023-10-12T08:22:21Z"
        readyToUse: false
        restoreSize: 1073741824
        snapshotHandle: projects/openshift-qe/global/snapshots/snapshot-b87c0704-ca99-4b89-a249-320b6f70c54f 

      Expected results:
      CSI plugin should wait at least for the specified csiSnapshotTimeout. 

       

      Additional info:

      Attached velero logs:-
      https://privatebin.corp.redhat.com/?efa82a9c1cdcb3ec#F3uxhh2BJAziZGXcVJxsT32VYHiWcGVmSfPav7v5ryBz

      Attachments

        Issue Links

          Activity

            People

              spampatt@redhat.com Shubham Pampattiwar
              rhn-support-prajoshi Prasad Joshi
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: