Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-2450

[Upstream] NativeDataMover: Backup stucks in phase WaitingForPluginOperations when Node Agent pod gets restarted

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • oadp-operator-bundle-container-1.3.0-137
    • ToDo
    • 0
    • 0
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown
    • No

      Description of problem:

      Backups got stuck in WaitingForPluginOperations phase when node agent pod got restarted. Created a DPA with setting smaller cpu and memory limits in order cause a node-agent pod restart. 

       

      Version-Release number of selected component (if applicable):
      Installed OADP operator using velero-datamover branch

      OCP 4.13

       

      How reproducible:

      Always

       

      Steps to Reproduce:
      1. Create a DPA, set small cpu and memory values for node-agent pod.

        apiVersion: oadp.openshift.io/v1alpha1
        kind: DataProtectionApplication
        metadata:
          name: ts-dpa
          namespace: openshift-adp
          resourceVersion: "137751"
          uid: 337295a4-8978-4afc-81c5-6470238f9a16
        spec:
          backupLocations:
          - velero:
              credential:
                key: cloud
                name: cloud-credentials-gcp
              default: true
              objectStorage:
                bucket: oadpbucket227213
                prefix: velero-e2e-be4635b3-40d7-11ee-88d9-845cf3eff33a
              provider: gcp
          configuration:
            restic:
              enable: true
              podConfig:
                resourceAllocations:
                  limits:
                    cpu: 100m
                    memory: 50Mi
                  requests:
                    cpu: 50m
                    memory: 10Mi
            velero:
              defaultPlugins:
              - openshift
              - gcp
              - kubevirt
              - csi
          podDnsConfig: {}
          snapshotLocations: []
        status:
          conditions:
          - lastTransitionTime: "2023-08-22T10:36:30Z"
            message: Reconcile complete
            reason: Complete
            status: "True"
            type: Reconciled

      2. Deploy a stateful application. Make sure it has some data around 1 or 2GB
      3. Execute backup with native datamover.

      $ oc get backup test-backup2 -o yaml
      apiVersion: velero.io/v1
      kind: Backup
      metadata:
        annotations:
          velero.io/resource-timeout: 10m0s
          velero.io/source-cluster-k8s-gitversion: v1.26.6+6bf3f75
          velero.io/source-cluster-k8s-major-version: "1"
          velero.io/source-cluster-k8s-minor-version: "26"
        labels:
          velero.io/storage-location: ts-dpa-1
        name: test-backup2
        namespace: openshift-adp
        resourceVersion: "139085"
        uid: eab99b56-a863-4a04-b96d-125b29ac88c4
      spec:
        csiSnapshotTimeout: 10m0s
        defaultVolumesToFsBackup: false
        includedNamespaces:
        - attached-pvc
        itemOperationTimeout: 4h0m0s
        snapshotMoveData: true
        storageLocation: ts-dpa-1
        ttl: 720h0m0s
      status:
        backupItemOperationsAttempted: 1
        expiration: "2023-09-21T10:55:59Z"
        formatVersion: 1.1.0
        phase: WaitingForPluginOperations
        progress:
          itemsBackedUp: 39
          totalItems: 39
        startTimestamp: "2023-08-22T10:55:59Z"
        version: 1

      4. Verify Node Agent pod has 1 restart count.

       

      $ oc get pod
      NAME                                                              READY   STATUS      RESTARTS      AGE
      e9ec5d38e74536139e4d5027ce0c05a12f40d7595138a4987f0fa76035cs6jd   0/1     Completed   0             4h42m
      node-agent-8mrts                                                  1/1     Running     0             48m
      node-agent-fgsdw                                                  1/1     Running     1 (25m ago)   48m
      node-agent-mdgqg                                                  1/1     Running     0             48m
      openshift-adp-controller-manager-54877df7c4-wtbtw                 1/1     Running     0             4h41m
      test-backup2-8vb28                                                1/1     Running     0             27m
      ttl-sh-oadp-operator-bundle-3a42929-1h                            1/1     Running     0             4h42m
      velero-5d6cbbc5d-vhmpk                                            1/1     Running     0             32m
      

      Actual results:

      Backup gets stuck in waitingForPluginOperations phase. 

      DataUpload is also in InProgress status, looks like cancel true is not taking any effect. 

      $ oc get dataupload test-backup2-8vb28 -o yaml
      apiVersion: velero.io/v2alpha1
      kind: DataUpload
      metadata:
        creationTimestamp: "2023-08-22T10:56:04Z"
        generateName: test-backup2-
        generation: 5
        labels:
          velero.io/async-operation-id: du-eab99b56-a863-4a04-b96d-125b29ac88c4.fa094c8b-8308-488f5f389
          velero.io/backup-name: test-backup2
          velero.io/backup-uid: eab99b56-a863-4a04-b96d-125b29ac88c4
          velero.io/pvc-uid: fa094c8b-8308-488f-af41-d77c2278244f
        managedFields:
        - apiVersion: velero.io/v2alpha1
          fieldsType: FieldsV1
          fieldsV1:
            f:metadata:
              f:generateName: {}
              f:labels:
                .: {}
                f:velero.io/async-operation-id: {}
                f:velero.io/backup-name: {}
                f:velero.io/backup-uid: {}
                f:velero.io/pvc-uid: {}
              f:ownerReferences:
                .: {}
                k:{"uid":"eab99b56-a863-4a04-b96d-125b29ac88c4"}: {}
            f:spec:
              .: {}
              f:backupStorageLocation: {}
              f:csiSnapshot:
                .: {}
                f:snapshotClass: {}
                f:storageClass: {}
                f:volumeSnapshot: {}
              f:operationTimeout: {}
              f:snapshotType: {}
              f:sourceNamespace: {}
              f:sourcePVC: {}
            f:status:
              .: {}
              f:progress: {}
          manager: velero-plugin-for-csi
          operation: Update
          time: "2023-08-22T10:56:04Z"
        - apiVersion: velero.io/v2alpha1
          fieldsType: FieldsV1
          fieldsV1:
            f:spec:
              f:cancel: {}
            f:status:
              f:message: {}
              f:node: {}
              f:phase: {}
              f:startTimestamp: {}
          manager: node-agent-server
          operation: Update
          time: "2023-08-22T10:59:05Z"
        name: test-backup2-8vb28
        namespace: openshift-adp
        ownerReferences:
        - apiVersion: velero.io/v1
          controller: true
          kind: Backup
          name: test-backup2
          uid: eab99b56-a863-4a04-b96d-125b29ac88c4
        resourceVersion: "140453"
        uid: 11b8f306-ba4b-4389-8213-cc18830a2159
      spec:
        backupStorageLocation: ts-dpa-1
        cancel: true
        csiSnapshot:
          snapshotClass: ""
          storageClass: standard-csi
          volumeSnapshot: velero-volume1-2w8w6
        operationTimeout: 10m0s
        snapshotType: CSI
        sourceNamespace: attached-pvc
        sourcePVC: volume1
      status:
        message: found a dataupload with status "InProgress" during the node-agent starting, mark it as cancel
        node: oadp-51120-9pcz7-worker-b-ltj2x.c.openshift-qe.internal
        phase: InProgress
        progress: {}
        startTimestamp: "2023-08-22T10:56:04Z"

       

      Expected results:

      Backup should be marked as partiallyFailed when Node Agent pod gets restarted.

       

      Additional info:
      Node agent pod logs

      $ oc logs node-agent-fgsdw -f
      time="2023-08-22T10:59:04Z" level=info msg="Setting log-level to INFO"
      time="2023-08-22T10:59:04Z" level=info msg="Starting Velero node-agent server konveyor-dev (-)" logSource="pkg/cmd/cli/nodeagent/server.go:103"
      1.6927019443434076e+09    INFO    controller-runtime.metrics    Metrics server is starting to listen    {"addr": ":8080"}
      time="2023-08-22T10:59:04Z" level=info msg="Starting metric server for node agent at address []" logSource="pkg/cmd/cli/nodeagent/server.go:229"
      time="2023-08-22T10:59:04Z" level=info msg="Starting controllers" logSource="pkg/cmd/cli/nodeagent/server.go:245"
      time="2023-08-22T10:59:05Z" level=warning dataupload=test-backup2-8vb28 logSource="pkg/cmd/cli/nodeagent/server.go:378"
      time="2023-08-22T10:59:05Z" level=info msg="Controllers starting..." logSource="pkg/cmd/cli/nodeagent/server.go:287"
      1.6927019456482856e+09    INFO    Starting server    {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
      1.6927019456483958e+09    INFO    Starting EventSource    {"controller": "podvolumebackup", "controllerGroup": "velero.io", "controllerKind": "PodVolumeBackup", "source": "kind source: *v1.PodVolumeBackup"}
      1.692701945648438e+09    INFO    Starting Controller    {"controller": "podvolumebackup", "controllerGroup": "velero.io", "controllerKind": "PodVolumeBackup"}
      1.692701945648423e+09    INFO    Starting EventSource    {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload", "source": "kind source: *v2alpha1.DataDownload"}
      1.6927019456484365e+09    INFO    Starting EventSource    {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "source": "kind source: *v2alpha1.DataUpload"}
      1.6927019456484556e+09    INFO    Starting EventSource    {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload", "source": "kind source: *v2alpha1.DataDownloadList"}
      1.6927019456484594e+09    INFO    Starting EventSource    {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "source": "kind source: *v2alpha1.DataUploadList"}
      1.6927019456484637e+09    INFO    Starting EventSource    {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload", "source": "kind source: *v1.Pod"}
      1.6927019456484663e+09    INFO    Starting EventSource    {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "source": "kind source: *v1.Pod"}
      1.6927019456484683e+09    INFO    Starting Controller    {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload"}
      1.692701945648471e+09    INFO    Starting Controller    {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload"}
      1.6927019456484087e+09    INFO    Starting EventSource    {"controller": "podvolumerestore", "controllerGroup": "velero.io", "controllerKind": "PodVolumeRestore", "source": "kind source: *v1.PodVolumeRestore"}
      1.6927019456484845e+09    INFO    Starting EventSource    {"controller": "podvolumerestore", "controllerGroup": "velero.io", "controllerKind": "PodVolumeRestore", "source": "kind source: *v1.Pod"}
      1.6927019456484897e+09    INFO    Starting Controller    {"controller": "podvolumerestore", "controllerGroup": "velero.io", "controllerKind": "PodVolumeRestore"}
      time="2023-08-22T10:59:05Z" level=info msg="Reconcile test-backup2-8vb28" controller=dataupload dataupload=openshift-adp/test-backup2-8vb28 logSource="pkg/controller/data_upload_controller.go:109"
      time="2023-08-22T10:59:05Z" level=info msg="Data upload is in progress" controller=dataupload dataupload=openshift-adp/test-backup2-8vb28 logSource="pkg/controller/data_upload_controller.go:226"
      time="2023-08-22T10:59:05Z" level=info msg="PodVolumeBackup starting" backup=openshift-adp/backup1-60e85156-40ce-11ee-af13-845cf3eff33a controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:104" podvolumebackup=openshift-adp/backup1-60e85156-40ce-11ee-af13-845cf3eff33a-vlh45
      time="2023-08-22T10:59:05Z" level=info msg="PodVolumeBackup starting" backup=openshift-adp/backup1-00d35092-40cf-11ee-af13-845cf3eff33a controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:104" podvolumebackup=openshift-adp/backup1-00d35092-40cf-11ee-af13-845cf3eff33a-lj6tv
      1.6927019458372142e+09    INFO    Starting workers    {"controller": "podvolumebackup", "controllerGroup": "velero.io", "controllerKind": "PodVolumeBackup", "worker count": 1}
      1.6927019458372939e+09    INFO    Starting workers    {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "worker count": 1}
      1.6927019458374288e+09    INFO    Starting workers    {"controller": "podvolumerestore", "controllerGroup": "velero.io", "controllerKind": "PodVolumeRestore", "worker count": 1}
      1.692701945837552e+09    INFO    Starting workers    {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload", "worker count": 1}

              spampatt@redhat.com Shubham Pampattiwar
              rhn-support-prajoshi Prasad Joshi
              Prasad Joshi Prasad Joshi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: