-
Bug
-
Resolution: Done-Errata
-
Critical
-
OADP 1.3.0
-
4
-
False
-
-
False
-
oadp-operator-bundle-container-1.4.1-28
-
ToDo
-
-
-
0
-
0.000
-
Very Likely
-
0
-
None
-
Unset
-
Unknown
-
No
Description of problem:
I have been seeing this issue recently in OADP 1.3.0, if any of the VolumeSnapshotContent CR has an error related to removing VolumeSnapshotBeingCreated annotation in that case it moves backup to WaitingForPluginOperationsPartiallyFailed phase. Due to this most of the CSI/NativeDataMover backups are failing recently.
Usually this VolumeSnapshotContent takes 10-15 mins to become ready(In case of above error), as you in see in attached log below.
$ oc get vsc
NAME READYTOUSE RESTORESIZE DELETIONPOLICY DRIVER VOLUMESNAPSHOTCLASS VOLUMESNAPSHOT VOLUMESNAPSHOTNAMESPACE AGE
snapcontent-b87c0704-ca99-4b89-a249-320b6f70c54f true 1073741824 Retain pd.csi.storage.gke.io example-snapclass velero-cassandra-data-cassandra-0-bwxxc ocp-cassandra 9m19s
Version-Release number of selected component (if applicable):
OADP 1.3.0-117
How reproducible:
Intermittent(It happens when VolumeSnapshotContent CR has error related to removing annotation)
Steps to Reproduce:
1. Create a DPA with CSI enabled.
$ oc get dpa ts-dpa -o yaml apiVersion: oadp.openshift.io/v1alpha1 kind: DataProtectionApplication metadata: creationTimestamp: "2023-10-12T07:50:55Z" generation: 1 managedFields: - apiVersion: oadp.openshift.io/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:backupLocations: {} f:configuration: .: {} f:velero: .: {} f:defaultPlugins: {} manager: kubectl-create operation: Update time: "2023-10-12T07:50:55Z" - apiVersion: oadp.openshift.io/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:status: .: {} f:conditions: {} manager: manager operation: Update subresource: status time: "2023-10-12T07:50:55Z" name: ts-dpa namespace: openshift-adp resourceVersion: "63211" uid: d1d372b7-a44a-47da-a02d-898e429b96db spec: backupLocations: - velero: default: true objectStorage: bucket: oadpbucket239326 prefix: velero provider: gcp configuration: velero: defaultPlugins: - gcp - openshift - csi status: conditions: - lastTransitionTime: "2023-10-12T07:50:55Z" message: Reconcile complete reason: Complete status: "True" type: Reconciled
2. VSclass
$ oc get vsclass example-snapclass -o yaml apiVersion: snapshot.storage.k8s.io/v1 deletionPolicy: Retain driver: pd.csi.storage.gke.io kind: VolumeSnapshotClass metadata: annotations: snapshot.storage.kubernetes.io/is-default-class: "true" creationTimestamp: "2023-10-12T07:50:44Z" generation: 1 labels: velero.io/csi-volumesnapshot-class: "true" managedFields: - apiVersion: snapshot.storage.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:deletionPolicy: {} f:driver: {} f:metadata: f:annotations: .: {} f:snapshot.storage.kubernetes.io/is-default-class: {} f:labels: .: {} f:velero.io/csi-volumesnapshot-class: {} manager: kubectl-create operation: Update time: "2023-10-12T07:50:44Z" name: example-snapclass resourceVersion: "63127" uid: fdbde537-dadd-4baa-a4f8-19edf382a3e0
3. Deploy an application which has more than 1 PVC.
$ oc get pod -n ocp-cassandra NAME READY STATUS RESTARTS AGE cassandra-0 1/1 Running 0 47m cassandra-1 1/1 Running 0 46m cassandra-2 1/1 Running 2 (45m ago) 46m
$ oc get pvc -n ocp-cassandra NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE cassandra-data-cassandra-0 Bound pvc-9d285012-5a61-4c7e-bb8d-53424b8b4925 1Gi RWO standard-csi 47m cassandra-data-cassandra-1 Bound pvc-4274b6b7-6465-4c2f-8173-6cba6ccb8a32 1Gi RWO standard-csi 47m cassandra-data-cassandra-2 Bound pvc-6240bf9e-fc12-45d0-94f3-343d8cb85200 1Gi RWO standard-csi 46m
4. Execute CSI backup
$ oc get backup test-backup -o yaml apiVersion: velero.io/v1 kind: Backup metadata: annotations: velero.io/resource-timeout: 10m0s velero.io/source-cluster-k8s-gitversion: v1.27.6+98158f9 velero.io/source-cluster-k8s-major-version: "1" velero.io/source-cluster-k8s-minor-version: "27" creationTimestamp: "2023-10-12T08:22:06Z" generation: 11 labels: velero.io/storage-location: ts-dpa-1 name: test-backup namespace: openshift-adp resourceVersion: "77631" uid: a1575cbc-f2e1-48ca-ab6b-2774f6d41aa1 spec: csiSnapshotTimeout: 10m0s defaultVolumesToFsBackup: false includedNamespaces: - ocp-cassandra itemOperationTimeout: 4h0m0s storageLocation: ts-dpa-1 ttl: 720h0m0s status: backupItemOperationsAttempted: 6 backupItemOperationsCompleted: 5 backupItemOperationsFailed: 1 completionTimestamp: "2023-10-12T08:30:14Z" csiVolumeSnapshotsAttempted: 3 errors: 1 expiration: "2023-11-11T08:22:06Z" formatVersion: 1.1.0 phase: PartiallyFailed progress: itemsBackedUp: 85 totalItems: 85 startTimestamp: "2023-10-12T08:22:06Z" version: 1 warnings: 1
Actual results:
Backup marked as partially Failed.
$ oc logs velero-85fb774687-mhml9 | grep level=error time="2023-10-12T07:51:12Z" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/0/1)" controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:194" time="2023-10-12T08:22:41Z" level=error msg=0 backup=openshift-adp/test-backup logSource="/remote-source/velero/app/pkg/controller/backup_controller.go:722"
VSC.yaml
apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotContent metadata: creationTimestamp: "2023-10-12T08:22:11Z" finalizers: - snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection generation: 1 labels: velero.io/backup-name: test-backup managedFields: - apiVersion: snapshot.storage.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: {} v:"snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection": {} f:spec: .: {} f:deletionPolicy: {} f:driver: {} f:source: .: {} f:volumeHandle: {} f:volumeSnapshotClassName: {} f:volumeSnapshotRef: .: {} f:apiVersion: {} f:kind: {} f:name: {} f:namespace: {} f:resourceVersion: {} f:uid: {} manager: snapshot-controller operation: Update time: "2023-10-12T08:22:11Z" - apiVersion: snapshot.storage.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:status: .: {} f:creationTime: {} f:error: .: {} f:message: {} f:time: {} f:readyToUse: {} f:restoreSize: {} f:snapshotHandle: {} manager: csi-snapshotter operation: Update subresource: status time: "2023-10-12T08:22:21Z" - apiVersion: snapshot.storage.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:velero.io/backup-name: {} manager: velero-plugin-for-csi operation: Update time: "2023-10-12T08:22:21Z" name: snapcontent-b87c0704-ca99-4b89-a249-320b6f70c54f resourceVersion: "74637" uid: 22ddda34-e7a4-4a0c-8a7e-0b2770fbfd60 spec: deletionPolicy: Retain driver: pd.csi.storage.gke.io source: volumeHandle: projects/openshift-qe/zones/us-central1-c/disks/pvc-9d285012-5a61-4c7e-bb8d-53424b8b4925 volumeSnapshotClassName: example-snapclass volumeSnapshotRef: apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot name: velero-cassandra-data-cassandra-0-bwxxc namespace: ocp-cassandra resourceVersion: "74531" uid: b87c0704-ca99-4b89-a249-320b6f70c54f status: creationTime: 1697098932063000000 error: message: 'Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-b87c0704-ca99-4b89-a249-320b6f70c54f: "snapshot controller failed to update snapcontent-b87c0704-ca99-4b89-a249-320b6f70c54f on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-b87c0704-ca99-4b89-a249-320b6f70c54f\": the object has been modified; please apply your changes to the latest version and try again"' time: "2023-10-12T08:22:21Z" readyToUse: false restoreSize: 1073741824 snapshotHandle: projects/openshift-qe/global/snapshots/snapshot-b87c0704-ca99-4b89-a249-320b6f70c54f
Expected results:
CSI plugin should wait at least for the specified csiSnapshotTimeout.
Additional info:
Attached velero logs:-
https://privatebin.corp.redhat.com/?efa82a9c1cdcb3ec#F3uxhh2BJAziZGXcVJxsT32VYHiWcGVmSfPav7v5ryBz
- is related to
-
OADP-3188 Failure in periodic-ci-oadp-qe-oadp-qe-automation-main-oadp1.3-ocp4.15-lp-interop-oadp-interop-aws, 12-04-2023
- Closed
- links to
-
RHBA-2024:132893 OpenShift API for Data Protection (OADP) 1.4.1 security and bug fix update