Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-3410

Velero is taking more than enough time to mark backup as partiallyFailed

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Normal Normal
    • None
    • OADP 1.3.1
    • csi-plugin
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ToDo
    • No
    • 0
    • 0
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown

      Description of problem:

      Triggered a CSI backup with setting csiSnapshotTimeout field as 2min. I noticed backup was in WaitingForPluginOperationsPartiallyFailed for almost 10+ minutes even though the specified CSI timeout was 2 minutes. This issue only happens when VolumeSnapshotContent has this error. 

       error:
          message: 'Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-b87c0704-ca99-4b89-a249-320b6f70c54f: "snapshot controller failed to update snapcontent-b87c0704-ca99-4b89-a249-320b6f70c54f on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-b87c0704-ca99-4b89-a249-320b6f70c54f\": the object has been modified; please apply your changes to the latest version and try again"' 

       

      Attached start and completion timestamp below:- 

      startTimestamp: "2024-01-30T06:01:44Z"
      completionTimestamp: "2024-01-30T06:16:43Z"

       

       

      Version-Release number of selected component (if applicable):
      OADP 1.3.1 

       

      How reproducible:
      Intermittent

       

      Steps to Reproduce:
      1. Deploy a stateful application which has at least 1 PVC. 

      2. Trigger CSI backup

      Actual results:

      Backup took 10+ minutes to move from WaitingForPluginOperationsPartiallyFailed to PartiallyFailed status. 

      $ oc get backup test-backup1 -o yaml
      apiVersion: velero.io/v1
      kind: Backup
      metadata:
        annotations:
          velero.io/resource-timeout: 10m0s
          velero.io/source-cluster-k8s-gitversion: v1.26.13+77e61a2
          velero.io/source-cluster-k8s-major-version: "1"
          velero.io/source-cluster-k8s-minor-version: "26"
        creationTimestamp: "2024-01-30T06:01:44Z"
        generation: 8
        labels:
          velero.io/storage-location: ts-dpa-1
        name: test-backup1
        namespace: openshift-adp
        resourceVersion: "56731"
        uid: a7e3b2c5-e954-47e0-96e5-a006a7251d4b
      spec:
        csiSnapshotTimeout: 2m
        defaultVolumesToFsBackup: false
        includedNamespaces:
        - ocp-mysql
        itemOperationTimeout: 4h0m0s
        snapshotMoveData: false
        storageLocation: ts-dpa-1
        ttl: 720h0m0s
      status:
        backupItemOperationsAttempted: 4
        backupItemOperationsCompleted: 3
        backupItemOperationsFailed: 1
        completionTimestamp: "2024-01-30T06:16:43Z"
        csiVolumeSnapshotsAttempted: 2
        csiVolumeSnapshotsCompleted: 2
        errors: 1
        expiration: "2024-02-29T06:01:44Z"
        formatVersion: 1.1.0
        phase: PartiallyFailed
        progress:
          itemsBackedUp: 66
          totalItems: 66
        startTimestamp: "2024-01-30T06:01:44Z"
        version: 1

      Expected results:

      Backup should only wait for the specified csiSnapshotTimeout.

       

      Additional info:

      Attached velero logs below:- 
      velero-logs

            spampatt@redhat.com Shubham Pampattiwar
            rhn-support-prajoshi Prasad Joshi
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: