-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.20
-
None
Description of problem:
I suspect in the hypershift-oadp-plugin there is a race condition in CheckVolumeSnapshot() at pkg/common/utils.go:274-330. When iterating over VolumeSnapshots from a List call (line 286), if a snapshot is deleted between the List and subsequent Get call (line 292), the 404 NotFound error is treated as fatal. This aborts processing of subsequent snapshots, causing backups to be marked PartiallyFailed with missing etcd data.
Version-Release number of selected component (if applicable):
velero server: quay.io/konveyor/velero@sha256:a23111a98c9a7e99ce7a36e2a9288614dcfe13cd5ec9ac2bcf52c849772b160a azurePlugin: quay.io/konveyor/velero-plugin-for-microsoft-azure@sha256:b2db5f09da514e817a74c992dcca5f90b77c2ab0b2797eba947d224271d6070e hypershiftPlugin: quay.io/redhat-user-workloads/ocp-art-tenant/oadp-hypershift-oadp-plugin-main@sha256:51df9e40bfa8cf943d6723913e5408c6e435b4d4419c90da001310197c4c0cce
How reproducible:
Intermittent - observed in 25/100 scheduled backups. Depends on timing of snapshot cleanup vs plugin processing.
Steps to Reproduce:
1. Schedule recurring Velero backups of a HyperShift cluster with multiple etcd PVCs 2. Wait for backups to run with DataMover enabled 3. Observe PartiallyFailed backups when snapshot cleanup races with plugin's List/Get sequence
Actual results:
Backup fails with error from WaitForVolumeSnapshot() (pkg/common/utils.go:361): giving up, VolumeSnapshot was not finished in the expected timeout. Err: failed to get volumeSnapshot: volumesnapshots.snapshot.storage.k8s.io "velero-data-etcd-X-XXXXX" not found One or more etcd PVC snapshots are missing from backup storage.
Expected results:
Deleted snapshots (already completed and cleaned up) should be skipped gracefully, allowing subsequent snapshots to be processed. Backup should complete successfully.
Additional info:
ARO-HCP uses PremiumV2 disks for etcd. PremiumV2 disks do not have instant access snapshot ability and will take 8-10 minutes to snapshot each disk. You may not hit the issue if you use PremiumV2 disks for etcd. You can use Premium disks since they do have instant access snapshots OR I can provide a forked version of the AzureDisk-CSI-Driver that enables Instant Access snapshots for PremiumV2 disks.
I vibed up a fix locally and saw success with backups since deploying it.
Add apierrors.IsNotFound(err) handling in CheckVolumeSnapshot() at line 292-294, matching the existing pattern in CheckVolumeSnapshotContent() at lines 194-198: if err := c.Get(ctx, types.NamespacedName{...}, object); err != nil {
if !apierrors.IsNotFound(err) {
return started, finished, fmt.Errorf("failed to get volumeSnapshot: %w", err)
}
continue // Skip deleted snapshots gracefully
}