Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: OADP 1.6.0
Affects Version/s: OADP 1.5.0
Component/s: oadp-operator
Labels:

Activity Type:
Quality / Stability / Reliability
Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
QEStatus:
ToDo
Intelligence Requested:
Market:

Risk Probability:
Very Likely
Risk Score:
0

Workstream:

None

Root Cause:
Unset
Failure Category:
Unknown

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

In OADP 1.5 (Velero 1.16), when any node-agent pod restarts, all accepted DataUploads across all node-agents are canceled, even if they are being handled by different node-agent pods. This causes backup operations to fail and can block the entire backup queue for extended periods.

How reproducible:

If an agent-node resets, it'll cause the datauploads to cancel

Steps to Reproduce:
1. Start a backup
2. Kill an agent-node
3. Monitor current backup/dataupload

Actual results:

Test Environment:
- 32 ROSA HCP clusters
- OADP 1.5 on Management Cluster
- Daily scheduled backups (3 etcd volumes per cluster = 96 DataUploads total)

Details:

- Stuck Backup: 2m3dggopfhtebvfa1rqhstejg3i7sou8-daily-20251023003022

- Failed DataUpload: 2m3dggopfhtebvfa1rqhstejg3i7sou8-daily-20251023003022-wx2fr
- Controlling Node-Agent: node-agent-nmvn2 on node ip-10-0-170-239.ec2.internal
- Timeline:
- 01:19:27Z - DataUpload accepted by ip-10-0-170-239.ec2.internal
- 01:19:27Z - Started exposing CSI snapshot
- 01:20:01Z - Canceled (34 seconds after accept)
- Blocked queue until 14:00Z (~13 hours)

Cancellation Message:
"Dataupload is in Accepted status during the node-agent starting, mark it as cancel"

Error:
volumesnapshots.snapshot.storage.k8s.io "velero-data-etcd-0-5ffgn" not found

Node-Agent Logs (node-agent-nmvn2):

time="2025-10-23T01:19:27Z" level=info msg="This datauplod has been accepted by ip-10-0-170-239.ec2.internal" Dataupload=2m3dggopfhtebvfa1rqhstejg3i7sou8-daily-20251023003022-wx2fr
time="2025-10-23T01:19:27Z" level=info msg="Exposing CSI snapshot" owner=2m3dggopfhtebvfa1rqhstejg3i7sou8-daily-20251023003022-wx2fr
time="2025-10-23T01:20:01Z" level=warning msg="expose snapshot with err error wait volume snapshot ready: error to get VolumeSnapshot ocm-int-2m3dggopfhtebvfa1rqhstejg3i7sou8-k5l6b6h1f8v0l9a/velero-data-etcd-0-5ffgn: volumesnapshots.snapshot.storage.k8s.io \"velero-data-etcd-0-5ffgn\" not found but it may caused by clean up resources in cancel action"

There is a known Velero issue fixed in 1.17:
- Upstream Issue: https://github.com/vmware-tanzu/velero/issues/8534
- Upstream Fix: https://github.com/vmware-tanzu/velero/pull/8952/files
- Status: Fixed in Velero 1.17, but requires significant controller refactoring

Request:

Backport the Velero 1.17 fix to OADP 1.5.z for the rosa-86

Assignee:: Wes Hayutin

Reporter:: Eric Cambel

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/10/23 8:01 PM

Updated:: 2025/11/05 8:39 PM

Details

Description

Description of problem:

Actual results:

Attachments

Easy Agile Planning Poker

Activity

People

Dates