Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-4265

Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress"

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • oadp-operator-bundle-container-1.3.3-12
    • ToDo
    • 0
    • 0
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown
    • No

      Description of problem:

      Velero was performing a restore when the API server was rolling out to a new version. It had trouble connecting to the API server, but eventually, the restore was successful. However, since the API server was still in the middle of rolling out, Velero failed to update the restore CR status and gave up. After the connection was restored, it didn't attempt to update, causing the restore CR to be stuck at "In progress" indefinitely. This can lead to incorrect decisions for other components that rely on the backup/restore CR status to determine completion. 

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:
      1.
      2.
      3.

      Actual results:

       

      Expected results:

       

      Additional info:

      Verero logs:

      time="2023-12-08T04:02:21Z" level=warning msg="Cluster resource restore warning: could not restore, CustomResourceDefinition \"klusterlets.operator.open-cluster-management.io\" already exists. Warning: the in-cluster version is different than the backed-up version."  logSource="/remotesource/velero/app/pkg/controller/restore_controller.go:506" restore=openshift-adp/acm-klusterlet
      time="2023-12-08T04:02:21Z" level=warning msg="Cluster resource restore warning: refresh discovery after restoring CRDs: Get \"
      [https://172.30.0.1:443/api?timeout=32s|https://172.30.0.1/api?timeout=32s]
      \": dial tcp 172.30.0.1:443: connect: connection refused" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:506" restore=openshift-adp/acm-klusterlet
      
      ################Restore completed
      time="2023-12-08T04:02:21Z" level=info msg="restore completed" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:513" restore=openshift-adp/acm-klusterlet
      
      time="2023-12-08T04:02:21Z" level=error msg="Get \"
      [https://172.30.0.1:443/api?timeout=32s|https://172.30.0.1/api?timeout=32s]
      \": dial tcp 172.30.0.1:443: connect: connection refused" logSource="/remote-source/velero/app/pkg/datamover/datamover.go:143"
      time="2023-12-08T04:02:21Z" level=error msg="Error removing VSRs after partially failed restore" error="Get \"
      [https://172.30.0.1:443/api?timeout=32s|https://172.30.0.1/api?timeout=32s]
      \": dial tcp 172.30.0.1:443: connect: connection refused" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:571"
      
      ################FAIL to update restore CR status
      time="2023-12-08T04:02:21Z" level=info msg="Error updating restore's final status" Restore=openshift-adp/acm-klusterlet error="Patch \"
      [https://172.30.0.1:443/apis/velero.io/v1/namespaces/openshift-adp/restores/acm-klusterlet|https://172.30.0.1/apis/velero.io/v1/namespaces/openshift-adp/restores/acm-klusterlet]
      \": dial 
      tcp 172.30.0.1:443: connect: connection refused" error.file="/remote-source/velero/app/pkg/controller/restore_controller.go:216" error.function="github.com/vmware-tanzu/velero/pkg/controller.(*restoreReconciler).Reconcile" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:216"
      ...
      ################connection is back and it's stable
      time="2023-12-08T04:02:58Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=openshift-adp/oadp-2 controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:152"
      time="2023-12-08T04:02:58Z" level=info msg="BackupStorageLocations is validc, marking as available" backup-storage-location=openshift-adp/oadp-2 controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:137"
      

      Restore CR stuck at "In progress"

      apiVersion: velero.io/v1
      kind: Restore
      metadata:
        annotations:
          lca.openshift.io/apply-wave: "1"
        creationTimestamp: "2023-12-08T04:01:59Z"
        generation: 4
        labels:
          velero.io/storage-location: default
        name: acm-klusterlet
        namespace: openshift-adp
        resourceVersion: "45514"
        uid: 392ad1c8-2a1a-4228-9eaa-d3cb28101de3
      spec:
        backupName: acm-klusterlet
        excludedResources:
        - nodes
        - events
        - events.events.k8s.io
        - backups.velero.io
        - restores.velero.io
        - resticrepositories.velero.io
        - csinodes.storage.k8s.io
        - volumeattachments.storage.k8s.io
        - backuprepositories.velero.io
        hooks: {}
        itemOperationTimeout: 1h0m0s
      status:
        phase: InProgress
        progress:
          itemsRestored: 1
          totalItems: 63
        startTimestamp: "2023-12-08T04:01:59Z"
      

              sseago Scott Seago
              angwang@redhat.com Angie Wang
              Prasad Joshi Prasad Joshi
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: