-
Bug
-
Resolution: Unresolved
-
Major
-
OADP 1.5.0
-
Quality / Stability / Reliability
-
3
-
False
-
-
False
-
ToDo
-
-
-
Very Likely
-
0
-
None
-
Unset
-
Unknown
-
None
Description of problem:
In OADP 1.5 (Velero 1.16), when any node-agent pod restarts, all accepted DataUploads across all node-agents are canceled, even if they are being handled by different node-agent pods. This causes backup operations to fail and can block the entire backup queue for extended periods.
How reproducible:
If an agent-node resets, it'll cause the datauploads to cancel
Steps to Reproduce:
1. Start a backup
2. Kill an agent-node
3. Monitor current backup/dataupload
Actual results:
Test Environment:
- 32 ROSA HCP clusters
- OADP 1.5 on Management Cluster
- Daily scheduled backups (3 etcd volumes per cluster = 96 DataUploads total)
Details:
- Stuck Backup: 2m3dggopfhtebvfa1rqhstejg3i7sou8-daily-20251023003022
- Failed DataUpload: 2m3dggopfhtebvfa1rqhstejg3i7sou8-daily-20251023003022-wx2fr
- Controlling Node-Agent: node-agent-nmvn2 on node ip-10-0-170-239.ec2.internal
- Timeline:
- 01:19:27Z - DataUpload accepted by ip-10-0-170-239.ec2.internal
- 01:19:27Z - Started exposing CSI snapshot
- 01:20:01Z - Canceled (34 seconds after accept)
- Blocked queue until 14:00Z (~13 hours)
Cancellation Message:
"Dataupload is in Accepted status during the node-agent starting, mark it as cancel"
Error:
volumesnapshots.snapshot.storage.k8s.io "velero-data-etcd-0-5ffgn" not found
Node-Agent Logs (node-agent-nmvn2):
time="2025-10-23T01:19:27Z" level=info msg="This datauplod has been accepted by ip-10-0-170-239.ec2.internal" Dataupload=2m3dggopfhtebvfa1rqhstejg3i7sou8-daily-20251023003022-wx2fr
time="2025-10-23T01:19:27Z" level=info msg="Exposing CSI snapshot" owner=2m3dggopfhtebvfa1rqhstejg3i7sou8-daily-20251023003022-wx2fr
time="2025-10-23T01:20:01Z" level=warning msg="expose snapshot with err error wait volume snapshot ready: error to get VolumeSnapshot ocm-int-2m3dggopfhtebvfa1rqhstejg3i7sou8-k5l6b6h1f8v0l9a/velero-data-etcd-0-5ffgn: volumesnapshots.snapshot.storage.k8s.io \"velero-data-etcd-0-5ffgn\" not found but it may caused by clean up resources in cancel action"
There is a known Velero issue fixed in 1.17:
- Upstream Issue: https://github.com/vmware-tanzu/velero/issues/8534
- Upstream Fix: https://github.com/vmware-tanzu/velero/pull/8952/files
- Status: Fixed in Velero 1.17, but requires significant controller refactoring
Request:
Backport the Velero 1.17 fix to OADP 1.5.z for the rosa-86