Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-6896

Node agent pod restarts cancel all DataUploads across all nodes, blocking backup queue in OADP 1.5

XMLWordPrintable

    • Quality / Stability / Reliability
    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • ToDo
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown
    • None

      Description of problem:

      In OADP 1.5 (Velero 1.16), when any node-agent pod restarts, all accepted DataUploads across all node-agents are canceled, even if they are being handled by different node-agent pods. This causes backup operations to fail and can block the entire backup queue for extended periods.

      How reproducible:

      If an agent-node resets, it'll cause the datauploads to cancel

      Steps to Reproduce:
      1. Start a backup
      2. Kill an agent-node
      3. Monitor current backup/dataupload

      Actual results:

      Test Environment:
        - 32 ROSA HCP clusters
        - OADP 1.5 on Management Cluster
        - Daily scheduled backups (3 etcd volumes per cluster = 96 DataUploads total)

      Details:

        - Stuck Backup: 2m3dggopfhtebvfa1rqhstejg3i7sou8-daily-20251023003022

        - Failed DataUpload: 2m3dggopfhtebvfa1rqhstejg3i7sou8-daily-20251023003022-wx2fr
        - Controlling Node-Agent: node-agent-nmvn2 on node ip-10-0-170-239.ec2.internal
        - Timeline:
          - 01:19:27Z - DataUpload accepted by ip-10-0-170-239.ec2.internal
          - 01:19:27Z - Started exposing CSI snapshot
          - 01:20:01Z - Canceled (34 seconds after accept)
          - Blocked queue until 14:00Z (~13 hours)

      Cancellation Message:
        "Dataupload is in Accepted status during the node-agent starting, mark it as cancel"

      Error:
        volumesnapshots.snapshot.storage.k8s.io "velero-data-etcd-0-5ffgn" not found

        Node-Agent Logs (node-agent-nmvn2):

        time="2025-10-23T01:19:27Z" level=info msg="This datauplod has been accepted by ip-10-0-170-239.ec2.internal" Dataupload=2m3dggopfhtebvfa1rqhstejg3i7sou8-daily-20251023003022-wx2fr
        time="2025-10-23T01:19:27Z" level=info msg="Exposing CSI snapshot" owner=2m3dggopfhtebvfa1rqhstejg3i7sou8-daily-20251023003022-wx2fr
        time="2025-10-23T01:20:01Z" level=warning msg="expose snapshot with err error wait volume snapshot ready: error to get VolumeSnapshot ocm-int-2m3dggopfhtebvfa1rqhstejg3i7sou8-k5l6b6h1f8v0l9a/velero-data-etcd-0-5ffgn: volumesnapshots.snapshot.storage.k8s.io \"velero-data-etcd-0-5ffgn\" not found but it may caused by clean up resources in cancel action"

       

        There is a known Velero issue fixed in 1.17:
        - Upstream Issue: https://github.com/vmware-tanzu/velero/issues/8534
        - Upstream Fix: https://github.com/vmware-tanzu/velero/pull/8952/files
        - Status: Fixed in Velero 1.17, but requires significant controller refactoring

       

        Request:

        Backport the Velero 1.17 fix to OADP 1.5.z for the rosa-86

              wnstb Wes Hayutin
              ecambel.openshift Eric Cambel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: