-
Sub-task
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
4
-
False
-
-
False
-
ToDo
-
-
-
0
-
0.000
-
Very Likely
-
0
-
None
-
Unset
-
Unknown
Description of problem:
Backups got stuck in WaitingForPluginOperations phase when node agent pod got restarted. Created a DPA with setting smaller cpu and memory limits in order cause a node-agent pod restart.
Version-Release number of selected component (if applicable):
Installed OADP operator using velero-datamover branch
OCP 4.13
How reproducible:
Always
Steps to Reproduce:
1. Create a DPA, set small cpu and memory values for node-agent pod.
apiVersion: oadp.openshift.io/v1alpha1 kind: DataProtectionApplication metadata: name: ts-dpa namespace: openshift-adp resourceVersion: "137751" uid: 337295a4-8978-4afc-81c5-6470238f9a16 spec: backupLocations: - velero: credential: key: cloud name: cloud-credentials-gcp default: true objectStorage: bucket: oadpbucket227213 prefix: velero-e2e-be4635b3-40d7-11ee-88d9-845cf3eff33a provider: gcp configuration: restic: enable: true podConfig: resourceAllocations: limits: cpu: 100m memory: 50Mi requests: cpu: 50m memory: 10Mi velero: defaultPlugins: - openshift - gcp - kubevirt - csi podDnsConfig: {} snapshotLocations: [] status: conditions: - lastTransitionTime: "2023-08-22T10:36:30Z" message: Reconcile complete reason: Complete status: "True" type: Reconciled
2. Deploy a stateful application. Make sure it has some data around 1 or 2GB
3. Execute backup with native datamover.
$ oc get backup test-backup2 -o yaml apiVersion: velero.io/v1 kind: Backup metadata: annotations: velero.io/resource-timeout: 10m0s velero.io/source-cluster-k8s-gitversion: v1.26.6+6bf3f75 velero.io/source-cluster-k8s-major-version: "1" velero.io/source-cluster-k8s-minor-version: "26" labels: velero.io/storage-location: ts-dpa-1 name: test-backup2 namespace: openshift-adp resourceVersion: "139085" uid: eab99b56-a863-4a04-b96d-125b29ac88c4 spec: csiSnapshotTimeout: 10m0s defaultVolumesToFsBackup: false includedNamespaces: - attached-pvc itemOperationTimeout: 4h0m0s snapshotMoveData: true storageLocation: ts-dpa-1 ttl: 720h0m0s status: backupItemOperationsAttempted: 1 expiration: "2023-09-21T10:55:59Z" formatVersion: 1.1.0 phase: WaitingForPluginOperations progress: itemsBackedUp: 39 totalItems: 39 startTimestamp: "2023-08-22T10:55:59Z" version: 1
4. Verify Node Agent pod has 1 restart count.
$ oc get pod
NAME READY STATUS RESTARTS AGE
e9ec5d38e74536139e4d5027ce0c05a12f40d7595138a4987f0fa76035cs6jd 0/1 Completed 0 4h42m
node-agent-8mrts 1/1 Running 0 48m
node-agent-fgsdw 1/1 Running 1 (25m ago) 48m
node-agent-mdgqg 1/1 Running 0 48m
openshift-adp-controller-manager-54877df7c4-wtbtw 1/1 Running 0 4h41m
test-backup2-8vb28 1/1 Running 0 27m
ttl-sh-oadp-operator-bundle-3a42929-1h 1/1 Running 0 4h42m
velero-5d6cbbc5d-vhmpk 1/1 Running 0 32m
Actual results:
Backup gets stuck in waitingForPluginOperations phase.
DataUpload is also in InProgress status, looks like cancel true is not taking any effect.
$ oc get dataupload test-backup2-8vb28 -o yaml apiVersion: velero.io/v2alpha1 kind: DataUpload metadata: creationTimestamp: "2023-08-22T10:56:04Z" generateName: test-backup2- generation: 5 labels: velero.io/async-operation-id: du-eab99b56-a863-4a04-b96d-125b29ac88c4.fa094c8b-8308-488f5f389 velero.io/backup-name: test-backup2 velero.io/backup-uid: eab99b56-a863-4a04-b96d-125b29ac88c4 velero.io/pvc-uid: fa094c8b-8308-488f-af41-d77c2278244f managedFields: - apiVersion: velero.io/v2alpha1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:generateName: {} f:labels: .: {} f:velero.io/async-operation-id: {} f:velero.io/backup-name: {} f:velero.io/backup-uid: {} f:velero.io/pvc-uid: {} f:ownerReferences: .: {} k:{"uid":"eab99b56-a863-4a04-b96d-125b29ac88c4"}: {} f:spec: .: {} f:backupStorageLocation: {} f:csiSnapshot: .: {} f:snapshotClass: {} f:storageClass: {} f:volumeSnapshot: {} f:operationTimeout: {} f:snapshotType: {} f:sourceNamespace: {} f:sourcePVC: {} f:status: .: {} f:progress: {} manager: velero-plugin-for-csi operation: Update time: "2023-08-22T10:56:04Z" - apiVersion: velero.io/v2alpha1 fieldsType: FieldsV1 fieldsV1: f:spec: f:cancel: {} f:status: f:message: {} f:node: {} f:phase: {} f:startTimestamp: {} manager: node-agent-server operation: Update time: "2023-08-22T10:59:05Z" name: test-backup2-8vb28 namespace: openshift-adp ownerReferences: - apiVersion: velero.io/v1 controller: true kind: Backup name: test-backup2 uid: eab99b56-a863-4a04-b96d-125b29ac88c4 resourceVersion: "140453" uid: 11b8f306-ba4b-4389-8213-cc18830a2159 spec: backupStorageLocation: ts-dpa-1 cancel: true csiSnapshot: snapshotClass: "" storageClass: standard-csi volumeSnapshot: velero-volume1-2w8w6 operationTimeout: 10m0s snapshotType: CSI sourceNamespace: attached-pvc sourcePVC: volume1 status: message: found a dataupload with status "InProgress" during the node-agent starting, mark it as cancel node: oadp-51120-9pcz7-worker-b-ltj2x.c.openshift-qe.internal phase: InProgress progress: {} startTimestamp: "2023-08-22T10:56:04Z"
Expected results:
Backup should be marked as partiallyFailed when Node Agent pod gets restarted.
Additional info:
Node agent pod logs
$ oc logs node-agent-fgsdw -f time="2023-08-22T10:59:04Z" level=info msg="Setting log-level to INFO" time="2023-08-22T10:59:04Z" level=info msg="Starting Velero node-agent server konveyor-dev (-)" logSource="pkg/cmd/cli/nodeagent/server.go:103" 1.6927019443434076e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"} time="2023-08-22T10:59:04Z" level=info msg="Starting metric server for node agent at address []" logSource="pkg/cmd/cli/nodeagent/server.go:229" time="2023-08-22T10:59:04Z" level=info msg="Starting controllers" logSource="pkg/cmd/cli/nodeagent/server.go:245" time="2023-08-22T10:59:05Z" level=warning dataupload=test-backup2-8vb28 logSource="pkg/cmd/cli/nodeagent/server.go:378" time="2023-08-22T10:59:05Z" level=info msg="Controllers starting..." logSource="pkg/cmd/cli/nodeagent/server.go:287" 1.6927019456482856e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"} 1.6927019456483958e+09 INFO Starting EventSource {"controller": "podvolumebackup", "controllerGroup": "velero.io", "controllerKind": "PodVolumeBackup", "source": "kind source: *v1.PodVolumeBackup"} 1.692701945648438e+09 INFO Starting Controller {"controller": "podvolumebackup", "controllerGroup": "velero.io", "controllerKind": "PodVolumeBackup"} 1.692701945648423e+09 INFO Starting EventSource {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload", "source": "kind source: *v2alpha1.DataDownload"} 1.6927019456484365e+09 INFO Starting EventSource {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "source": "kind source: *v2alpha1.DataUpload"} 1.6927019456484556e+09 INFO Starting EventSource {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload", "source": "kind source: *v2alpha1.DataDownloadList"} 1.6927019456484594e+09 INFO Starting EventSource {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "source": "kind source: *v2alpha1.DataUploadList"} 1.6927019456484637e+09 INFO Starting EventSource {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload", "source": "kind source: *v1.Pod"} 1.6927019456484663e+09 INFO Starting EventSource {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "source": "kind source: *v1.Pod"} 1.6927019456484683e+09 INFO Starting Controller {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload"} 1.692701945648471e+09 INFO Starting Controller {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload"} 1.6927019456484087e+09 INFO Starting EventSource {"controller": "podvolumerestore", "controllerGroup": "velero.io", "controllerKind": "PodVolumeRestore", "source": "kind source: *v1.PodVolumeRestore"} 1.6927019456484845e+09 INFO Starting EventSource {"controller": "podvolumerestore", "controllerGroup": "velero.io", "controllerKind": "PodVolumeRestore", "source": "kind source: *v1.Pod"} 1.6927019456484897e+09 INFO Starting Controller {"controller": "podvolumerestore", "controllerGroup": "velero.io", "controllerKind": "PodVolumeRestore"} time="2023-08-22T10:59:05Z" level=info msg="Reconcile test-backup2-8vb28" controller=dataupload dataupload=openshift-adp/test-backup2-8vb28 logSource="pkg/controller/data_upload_controller.go:109" time="2023-08-22T10:59:05Z" level=info msg="Data upload is in progress" controller=dataupload dataupload=openshift-adp/test-backup2-8vb28 logSource="pkg/controller/data_upload_controller.go:226" time="2023-08-22T10:59:05Z" level=info msg="PodVolumeBackup starting" backup=openshift-adp/backup1-60e85156-40ce-11ee-af13-845cf3eff33a controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:104" podvolumebackup=openshift-adp/backup1-60e85156-40ce-11ee-af13-845cf3eff33a-vlh45 time="2023-08-22T10:59:05Z" level=info msg="PodVolumeBackup starting" backup=openshift-adp/backup1-00d35092-40cf-11ee-af13-845cf3eff33a controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:104" podvolumebackup=openshift-adp/backup1-00d35092-40cf-11ee-af13-845cf3eff33a-lj6tv 1.6927019458372142e+09 INFO Starting workers {"controller": "podvolumebackup", "controllerGroup": "velero.io", "controllerKind": "PodVolumeBackup", "worker count": 1} 1.6927019458372939e+09 INFO Starting workers {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "worker count": 1} 1.6927019458374288e+09 INFO Starting workers {"controller": "podvolumerestore", "controllerGroup": "velero.io", "controllerKind": "PodVolumeRestore", "worker count": 1} 1.692701945837552e+09 INFO Starting workers {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload", "worker count": 1}