Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.20
Component/s: HyperShift
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.22.0
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

I suspect in the hypershift-oadp-plugin there is a race condition in CheckVolumeSnapshot() at pkg/common/utils.go:274-330. When iterating over VolumeSnapshots from a List call (line 286), if a snapshot is deleted between the List and subsequent Get call (line 292), the 404 NotFound error is treated as fatal. This aborts processing of subsequent snapshots, causing backups to be marked PartiallyFailed with missing etcd data.

Version-Release number of selected component (if applicable):

velero server: quay.io/konveyor/velero@sha256:a23111a98c9a7e99ce7a36e2a9288614dcfe13cd5ec9ac2bcf52c849772b160a
azurePlugin: quay.io/konveyor/velero-plugin-for-microsoft-azure@sha256:b2db5f09da514e817a74c992dcca5f90b77c2ab0b2797eba947d224271d6070e
hypershiftPlugin: quay.io/redhat-user-workloads/ocp-art-tenant/oadp-hypershift-oadp-plugin-main@sha256:51df9e40bfa8cf943d6723913e5408c6e435b4d4419c90da001310197c4c0cce

How reproducible:

Intermittent - observed in 25/100 scheduled backups. Depends on timing of snapshot cleanup vs plugin processing.

Steps to Reproduce:

  1. Schedule recurring Velero backups of a HyperShift cluster with multiple etcd PVCs
  2. Wait for backups to run with DataMover enabled
  3. Observe PartiallyFailed backups when snapshot cleanup races with plugin's List/Get sequence

Actual results:

Backup fails with error from WaitForVolumeSnapshot() (pkg/common/utils.go:361):
  giving up, VolumeSnapshot was not finished in the expected timeout.
  Err: failed to get volumeSnapshot: volumesnapshots.snapshot.storage.k8s.io "velero-data-etcd-X-XXXXX" not found One or more etcd PVC snapshots are missing from backup storage.

Expected results:

Deleted snapshots (already completed and cleaned up) should be skipped gracefully, allowing subsequent snapshots to be processed. Backup should complete successfully.

Additional info:

ARO-HCP uses PremiumV2 disks for etcd.  PremiumV2 disks do not have instant access snapshot ability and will take 8-10 minutes to snapshot each disk.  You may not hit the issue if you use PremiumV2 disks for etcd.  You can use Premium disks since they do have instant access snapshots OR I can provide a forked version of the AzureDisk-CSI-Driver that enables Instant Access snapshots for PremiumV2 disks.

I vibed up a fix locally and saw success with backups since deploying it.

Add apierrors.IsNotFound(err) handling in CheckVolumeSnapshot() at line 292-294, matching the existing pattern in CheckVolumeSnapshotContent() at lines 194-198:  if err := c.Get(ctx, types.NamespacedName{...}, object); err != nil {
      if !apierrors.IsNotFound(err) {
          return started, finished, fmt.Errorf("failed to get volumeSnapshot: %w", err)
      }
      continue  // Skip deleted snapshots gracefully
  }

links to

openshift/hypershift-oadp-plugin#182: OCPBUGS-75913: Fix VolumeSnapshot race condition in backup processing

Assignee:: Juan Manuel Parrilla Madrid

Reporter:: Tony Schneider

QA Contact:: Jie Zhao

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2026/02/04 6:43 PM

Updated:: 2026/02/07 8:39 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates