-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
OADP 1.2.0
-
False
-
-
False
-
ToDo
-
-
-
0
-
0
-
Very Likely
-
0
-
None
-
Unset
-
Unknown
-
No
Description of problem:
When attempting to perform a backup, we have encountered an issue where the CSI snapshot process is taking an unreasonably long time to become ready. As a result, the backup operation is timing out and ultimately failing.
Version-Release number of selected component (if applicable):
oadp-operator-bundle-container-1.2.0-79
How reproducible:
It happens frequently enough.
Steps to Reproduce:
1. Deploy statefull application with 10 PVCs
2. Run backup on the application namespace.
3. Watch the VolumeSnapshot and VolumeSnapshotContent for each PVC, some may stuck without ready status.
Actual results:
The CSI snapshot is taking an extended period to become ready, causing backup operations to fail due to a timeout error.
Expected results:
The CSI snapshot should be ready within a reasonable time frame to allow for successful backup operations.
Additional info:
DPA:
oc get dataprotectionapplication.oadp.openshift.io -n openshift-adp example-gcp-velero -o yaml apiVersion: oadp.openshift.io/v1alpha1 kind: DataProtectionApplication metadata: creationTimestamp: "2023-05-31T13:21:27Z" generation: 1 name: example-gcp-velero namespace: openshift-adp resourceVersion: "192609" uid: d546a0d6-8868-4e7f-8bf4-c14f316ffa52 spec: backupLocations: - velero: credential: key: cloud name: cloud-credentials default: true objectStorage: bucket: oadpbucket206810 prefix: velero provider: gcp configuration: restic: enable: false velero: defaultPlugins: - openshift - csi - gcp - vsm features: dataMover: enable: true volumeOptionsForStorageClasses: standard-csi: destinationVolumeOptions: moverSecurityContext: true status: conditions: - lastTransitionTime: "2023-05-31T13:21:27Z" message: Reconcile complete reason: Complete status: "True" type: Reconciled
Backup:
oc get backup -n openshift-adp b4 -o yaml apiVersion: velero.io/v1 kind: Backup metadata: annotations: velero.io/source-cluster-k8s-gitversion: v1.25.8+37a9a08 velero.io/source-cluster-k8s-major-version: "1" velero.io/source-cluster-k8s-minor-version: "25" creationTimestamp: "2023-05-31T13:41:13Z" generation: 6 labels: velero.io/storage-location: example-gcp-velero-1 name: b4 namespace: openshift-adp resourceVersion: "210808" uid: 2ed825be-9763-4173-9140-20b91b44a81b spec: csiSnapshotTimeout: 10m0s defaultVolumesToFsBackup: false hooks: {} includedNamespaces: - a2 itemOperationTimeout: 1h0m0s metadata: {} storageLocation: example-gcp-velero-1 ttl: 720h0m0s status: backupItemOperationsAttempted: 9 backupItemOperationsCompleted: 9 completionTimestamp: "2023-05-31T13:57:52Z" csiVolumeSnapshotsAttempted: 10 csiVolumeSnapshotsCompleted: 10 errors: 2 expiration: "2023-06-30T13:41:13Z" formatVersion: 1.1.0 phase: PartiallyFailed startTimestamp: "2023-05-31T13:41:13Z" version: 1
VolumeSnapshot:
oc get volumesnapshot velero-cassandra-data-cassandra-4-86dkr -o yaml apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot metadata: creationTimestamp: "2023-05-31T13:43:18Z" finalizers: - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection generateName: velero-cassandra-data-cassandra-4- generation: 1 labels: velero.io/backup-name: b4 name: velero-cassandra-data-cassandra-4-86dkr namespace: a2 resourceVersion: "203525" uid: f6ac8f42-910a-4980-af54-161c9e280a48 spec: source: persistentVolumeClaimName: cassandra-data-cassandra-4 volumeSnapshotClassName: csi-gcp-vsc status: boundVolumeSnapshotContentName: snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 creationTime: "2023-05-31T13:43:19Z" error: message: 'Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48: "snapshot controller failed to update snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48\": the object has been modified; please apply your changes to the latest version and try again"' time: "2023-05-31T13:43:29Z" readyToUse: false restoreSize: 1Gi
VolumeSnapshotContent:
oc get vsc snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 -o yaml apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotContent metadata: creationTimestamp: "2023-05-31T13:43:18Z" finalizers: - snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection generation: 1 labels: velero.io/backup-name: b4 name: snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 resourceVersion: "203530" uid: 93d329cf-d258-493f-8e38-c61c263bf819 spec: deletionPolicy: Retain driver: pd.csi.storage.gke.io source: volumeHandle: projects/openshift-qe/zones/us-central1-c/disks/pvc-68c4bee0-f90d-4c99-8c03-3410aa1d7fb4 volumeSnapshotClassName: csi-gcp-vsc volumeSnapshotRef: apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot name: velero-cassandra-data-cassandra-4-86dkr namespace: a2 resourceVersion: "203388" uid: f6ac8f42-910a-4980-af54-161c9e280a48 status: creationTime: 1685540599542000000 error: message: 'Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48: "snapshot controller failed to update snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48\": the object has been modified; please apply your changes to the latest version and try again"' time: "2023-05-31T13:43:29Z" readyToUse: false restoreSize: 1073741824 snapshotHandle: projects/openshift-qe/global/snapshots/snapshot-f6ac8f42-910a-4980-af54-161c9e280a48
Velero logs:
time="2023-05-31T13:53:13Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 to have snapshot handle and be ready. Retrying in 5s" backup=openshift-adp/b4 cmd=/plugins/velero-plugin-for-vsm logSource="/remote-source/app/internal/util/util.go:464" pluginName=velero-plugin-for-vsm time="2023-05-31T13:53:18Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 to have snapshot handle and be ready. Retrying in 5s" backup=openshift-adp/b4 cmd=/plugins/velero-plugin-for-vsm logSource="/remote-source/app/internal/util/util.go:464" pluginName=velero-plugin-for-vsm time="2023-05-31T13:53:23Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 to have snapshot handle and be ready. Retrying in 5s" backup=openshift-adp/b4 cmd=/plugins/velero-plugin-for-vsm logSource="/remote-source/app/internal/util/util.go:464" pluginName=velero-plugin-for-vsm time="2023-05-31T13:53:28Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 to have snapshot handle and be ready. Retrying in 5s" backup=openshift-adp/b4 cmd=/plugins/velero-plugin-for-vsm logSource="/remote-source/app/internal/util/util.go:464" pluginName=velero-plugin-for-vsm time="2023-05-31T13:53:28Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48 to have snapshot handle and be ready. Retrying in 5s" backup=openshift-adp/b4 cmd=/plugins/velero-plugin-for-vsm logSource="/remote-source/app/internal/util/util.go:464" pluginName=velero-plugin-for-vsm time="2023-05-31T13:53:28Z" level=error msg="Timed out awaiting reconciliation of volumesnapshotcontent snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48" backup=openshift-adp/b4 cmd=/plugins/velero-plugin-for-vsm logSource="/remote-source/app/internal/util/util.go:472" pluginName=velero-plugin-for-vsm time="2023-05-31T13:53:28Z" level=info msg="1 errors encountered backup up item" backup=openshift-adp/b4 logSource="/remote-source/velero/app/pkg/backup/backup.go:431" name=cassandra-4 time="2023-05-31T13:53:28Z" level=error msg="Error backing up item" backup=openshift-adp/b4 error="error executing custom action (groupResource=volumesnapshotcontents.snapshot.storage.k8s.io, namespace=, name=snapcontent-f6ac8f42-910a-4980-af54-161c9e280a48): rpc error: code = Unknown desc = timed out waiting for the condition" logSource="/remote-source/velero/app/pkg/backup/backup.go:435" name=cassandra-4 time="2023-05-31T13:53:28Z" level=info msg="Backed up 26 items out of an estimated total of 211 (estimate will change throughout the backup)" backup=openshift-adp/b4 logSource="/remote-source/velero/app/pkg/backup/backup.go:397" name=cassandra-4 namespace=a2 progress= resource=pods
- account is impacted by
-
OADP-1719 Excessively long csi backups when volumesnaphot creation taking 40s instead of 1s
- Closed