Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-1029

DataMover: aws csi restore partiallyFailed after many restores

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Minor
    • OADP 1.1.1
    • OADP 1.1.1
    • data-mover
    • Hide

      None

      Show
      None
    • False
    • ToDo
    • No
    • 0
    • 0
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown

    Description

      Description of problem:

      Restore partiallyFailed after 14 success restores.

      Error which found in restore:

      error preparing volumesnapshotclasses.snapshot.storage.k8s.io/test-849-snapclass: rpc error: code = Unknown desc = timed out waiting for the condition 

       

      Error which found in VS and VSContent:

      error:
          message: 'Failed to check and update snapshot content: failed to list snapshot
            for content velero-velero-cassandra-data-cassandra-2-pr2xh-5vgqm: "rpc error:
            code = Internal desc = Could not list snapshots: InvalidParameterValue: Value
            ( 0 ) for parameter maxResults is invalid. Expecting a value greater than 5.\n\tstatus
            code: 400, request id: 13913292-0cd1-49f6-86be-d4d2bba20aa6"'
       

       

      Here is the DPA:

      apiVersion: oadp.openshift.io/v1alpha1
      kind: DataProtectionApplication
      metadata:
        creationTimestamp: "2022-11-09T08:47:38Z"
        generation: 1
        name: ts-dpa
        namespace: openshift-adp
        resourceVersion: "95617"
        uid: 4b3895ab-ae60-46dd-b8d3-37e4dc42bd96
      spec:
        backupLocations:
        - velero:
            config:
              region: us-east-2
            credential:
              key: cloud
              name: cloud-credentials
            default: true
            objectStorage:
              bucket: oadpbucket154245
              prefix: velero-e2e-22cc15a0-600b-11ed-a776-5405db5be9ea
            provider: aws
        configuration:
          restic:
            enable: true
            podConfig:
              resourceAllocations: {}
          velero:
            defaultPlugins:
            - openshift
            - aws
            - kubevirt
            - csi
        features:
          dataMover:
            enable: true
        podDnsConfig: {}
        snapshotLocations: []
      status:
        conditions:
        - lastTransitionTime: "2022-11-09T08:47:38Z"
          message: Reconcile complete
          reason: Complete
          status: "True"
          type: Reconciled
       

       

      Here are some errors from Velero:

      oc logs deploy/velero -n openshift-adp | grep error
      Defaulted container "velero" out of: velero, openshift-velero-plugin (init), velero-plugin-for-aws (init), kubevirt-velero-plugin (init), velero-plugin-for-csi (init)
      time="2022-11-09T08:48:11Z" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/0/1)" controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:173"
      time="2022-11-09T11:03:34Z" level=error msg="Timed out awaiting reconciliation of volumesnapshotrestore cassandra-ns/vsr-79txk" cmd=/plugins/velero-plugin-for-csi logSource="/remote-source/app/internal/util/util.go:498" pluginName=velero-plugin-for-csi restore=openshift-adp/test-849-dzjhq
      time="2022-11-09T11:03:34Z" level=error msg="Timed out awaiting reconciliation of volumesnapshotrestore cassandra-ns/vsr-mxwrl" cmd=/plugins/velero-plugin-for-csi logSource="/remote-source/app/internal/util/util.go:498" pluginName=velero-plugin-for-csi restore=openshift-adp/test-849-dzjhq
      time="2022-11-09T11:03:34Z" level=error msg="failed to wait for VolumeSnapshotRestores to be completed: timed out waiting for the condition" cmd=/plugins/velero-plugin-for-csi logSource="/remote-source/app/internal/util/util.go:531" pluginName=velero-plugin-for-csi restore=openshift-adp/test-849-dzjhq
      time="2022-11-09T11:03:43Z" level=error msg="Cluster resource restore error: error preparing volumesnapshotclasses.snapshot.storage.k8s.io/test-849-snapclass: rpc error: code = Unknown desc = timed out waiting for the condition" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:500" restore=openshift-adp/test-849-dzjhq
      time="2022-11-09T11:57:38Z" level=error msg="Error updating download request" controller=download-request downloadRequest=openshift-adp/test-849-dzjhq-08724d8a-62ad-47e3-9759-618e94d99761 error="downloadrequests.velero.io \"test-849-dzjhq-08724d8a-62ad-47e3-9759-618e94d99761\" not found" logSource="/remote-source/velero/app/pkg/controller/download_request_controller.go:74"
      time="2022-11-09T11:59:30Z" level=error msg="Error updating download request" controller=download-request downloadRequest=openshift-adp/test-849-dzjhq-83a362bd-4f58-471c-9c47-00783a03eec9 error="downloadrequests.velero.io \"test-849-dzjhq-83a362bd-4f58-471c-9c47-00783a03eec9\" not found" logSource="/remote-source/velero/app/pkg/controller/download_request_controller.go:74"
      time="2022-11-09T11:59:45Z" level=error msg="Error updating download request" controller=download-request downloadRequest=openshift-adp/test-849-dzjhq-1f971b8e-415d-4de3-a024-a2b98f730fe1 error="downloadrequests.velero.io \"test-849-dzjhq-1f971b8e-415d-4de3-a024-a2b98f730fe1\" not found" logSource="/remote-source/velero/app/pkg/controller/download_request_controller.go:74"
      time="2022-11-09T12:02:39Z" level=error msg="Error updating download request" controller=download-request downloadRequest=openshift-adp/test-849-dzjhq-fa0fa64a-67fb-46ec-9d51-6842720fd257 error="downloadrequests.velero.io \"test-849-dzjhq-fa0fa64a-67fb-46ec-9d51-6842720fd257\" not found" logSource="/remote-source/velero/app/pkg/controller/download_request_controller.go:74"
      time="2022-11-09T12:03:23Z" level=error msg="Error updating download request" controller=download-request downloadRequest=openshift-adp/test-849-dzjhq-bf5f52ac-c375-4da9-aebc-483fec8bab0e error="downloadrequests.velero.io \"test-849-dzjhq-bf5f52ac-c375-4da9-aebc-483fec8bab0e\" not found" logSource="/remote-source/velero/app/pkg/controller/download_request_controller.go:74"
      time="2022-11-09T12:15:01Z" level=error msg="Error updating download request" controller=download-request downloadRequest=openshift-adp/test-849-dzjhq-ccc4c57c-5e72-423d-8fc4-f51ae430b5fc error="downloadrequests.velero.io \"test-849-dzjhq-ccc4c57c-5e72-423d-8fc4-f51ae430b5fc\" not found" logSource="/remote-source/velero/app/pkg/controller/download_request_controller.go:74" 

       

      I attached must-gather for further investigation.

      Version-Release number of selected component (if applicable):

      OADP 1.1.1 Bundle: 1.1.1-39

      Volsync 0.5.1 Red Hat build

      How reproducible:

      Re run DataMover backup and restore till the restore partiallyFailed (14 times in this bug).

       

      Steps to Reproduce:
      1. Create a DataMover backup.
      2. Delete namespace and trigger restore multiple times.

      Actual results:

      Restore partiallyFailed after some success restores.

      Expected results:

      Restore should be success with Complete phase.

      Attachments

        Activity

          People

            emcmulla@redhat.com Emily McMullan
            sbahar Shahaf Bahar
            Shahaf Bahar Shahaf Bahar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: