Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-7069

DataUploads do not resume/canceled after node-agent restart, gets stuck

XMLWordPrintable

    • Quality / Stability / Reliability
    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • ToDo
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown
    • None

      Description of problem:

      Hi, I am trying to test the resumption of datauploads on node-agent pods restart but the Backup/Restore is getting stuck, sometimes not even getting canceled or resumed after pods restart. Showing the case of Backup here, same issue with restore as well.

      Version-Release number of selected component (if applicable):

      oadp-dev branch , 1.5.3

      How reproducible:

      Always

      Steps to Reproduce:
      1. Deploy any application with multiple PVCs, lets say mysql
      2. Perform the datamover backup
      3. As soon as the backup is triggered, delete all node-agent pods, so that they can get restarted.

      Actual results:

      Datauploads get stuck in Accepted state, not moving to In Progress, and Failing after long wait.

      Expected results:

      DataUploads should be resumed.

      Additional info:

      oc get dataupload -w
      NAME          STATUS      STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE     NODE
      test1-r27p7   Completed   2m30s     104857640    104857640     ts-dpa-1           3m40s   oadp-138671-7b84w-worker-c-l9t9v
      test1-trhwc   Accepted                                         ts-dpa-1           3m46s   
      
      oc get backup test1 -o yaml 
      apiVersion: velero.io/v1
      kind: Backup
      metadata:
        annotations:
          velero.io/resource-timeout: 10m0s
          velero.io/source-cluster-k8s-gitversion: v1.32.10
          velero.io/source-cluster-k8s-major-version: "1"
          velero.io/source-cluster-k8s-minor-version: "32"
        creationTimestamp: "2025-12-09T05:42:01Z"
        generation: 6
        labels:
          velero.io/storage-location: ts-dpa-1
        name: test1
        namespace: openshift-adp
        resourceVersion: "44451"
        uid: daf08633-69ab-4122-9e9d-13fc87e7419a
      spec:
        csiSnapshotTimeout: 10m0s
        defaultVolumesToFsBackup: false
        excludedClusterScopedResources:
        - volumesnapshotcontents.snapshot.storage.k8s.io
        excludedNamespaceScopedResources:
        - volumesnapshots.snapshot.storage.k8s.io
        includedNamespaces:
        - mysql
        itemOperationTimeout: 1h0m0s
        snapshotMoveData: true
        storageLocation: ts-dpa-1
        ttl: 720h0m0s
        volumeGroupSnapshotLabelKey: velero.io/volume-group
      status:
        backupItemOperationsAttempted: 2
        backupItemOperationsCompleted: 1
        expiration: "2026-01-08T05:42:01Z"
        formatVersion: 1.1.0
        hookStatus: {}
        phase: WaitingForPluginOperations
        progress:
          itemsBackedUp: 46
          totalItems: 46
        startTimestamp: "2025-12-09T05:42:01Z"
        version: 1
      
      oc get dataupload 
      NAME          STATUS      STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE   NODE
      test1-r27p7   Completed   35m       104857640    104857640     ts-dpa-1           36m   oadp-138671-7b84w-worker-c-l9t9v
      test1-trhwc   Failed                                           ts-dpa-1           36m   
      

      Fails after 36m.

      oc get backup test1 -o yaml
      apiVersion: velero.io/v1
      kind: Backup
      metadata:
        annotations:
          velero.io/resource-timeout: 10m0s
          velero.io/source-cluster-k8s-gitversion: v1.32.10
          velero.io/source-cluster-k8s-major-version: "1"
          velero.io/source-cluster-k8s-minor-version: "32"
        creationTimestamp: "2025-12-09T05:42:01Z"
        generation: 8
        labels:
          velero.io/storage-location: ts-dpa-1
        name: test1
        namespace: openshift-adp
        resourceVersion: "51954"
        uid: daf08633-69ab-4122-9e9d-13fc87e7419a
      spec:
        csiSnapshotTimeout: 10m0s
        defaultVolumesToFsBackup: false
        excludedClusterScopedResources:
        - volumesnapshotcontents.snapshot.storage.k8s.io
        excludedNamespaceScopedResources:
        - volumesnapshots.snapshot.storage.k8s.io
        includedNamespaces:
        - mysql
        itemOperationTimeout: 1h0m0s
        snapshotMoveData: true
        storageLocation: ts-dpa-1
        ttl: 720h0m0s
        volumeGroupSnapshotLabelKey: velero.io/volume-group
      status:
        backupItemOperationsAttempted: 2
        backupItemOperationsCompleted: 1
        backupItemOperationsFailed: 1
        completionTimestamp: "2025-12-09T06:12:25Z"
        errors: 1
        expiration: "2026-01-08T05:42:01Z"
        formatVersion: 1.1.0
        hookStatus: {}
        phase: PartiallyFailed
        progress:
          itemsBackedUp: 46
          totalItems: 46
        startTimestamp: "2025-12-09T05:42:01Z"
        version: 1
      

      DPA:

      oc get dpa -o yaml
      apiVersion: v1
      items:
      - apiVersion: oadp.openshift.io/v1alpha1
        kind: DataProtectionApplication
        metadata:
          creationTimestamp: "2025-12-09T05:34:30Z"
          generation: 3
          name: ts-dpa
          namespace: openshift-adp
          resourceVersion: "176925"
          uid: 7830597f-20a5-4394-8911-7dcd772426f0
        spec:
          backupLocations:
          - velero:
              credential:
                key: cloud
                name: cloud-credentials
              default: true
              objectStorage:
                bucket: oadp1386717b84w
                prefix: velero
              provider: gcp
          configuration:
            nodeAgent:
              enable: true
              uploaderType: kopia
            velero:
              defaultPlugins:
              - csi
              - gcp
              - openshift
              disableFsBackup: false
          logFormat: text
          nonAdmin:
            enable: true
        status:
          conditions:
          - lastTransitionTime: "2025-12-09T14:08:03Z"
            message: Reconcile complete
            reason: Complete
            status: "True"
            type: Reconciled
      kind: List
      metadata:
        resourceVersion: ""
      

              tkaovila@redhat.com Tiger Kaovilai
              rhn-support-ssingla Sachin Singla
              Sachin Singla Sachin Singla
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: