Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-7025

FS restore partially fails when loadAffinity set on NodeAgent pod

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Undefined Undefined
    • None
    • None
    • kopia, restore
    • None
    • Quality / Stability / Reliability
    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • ToDo
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown
    • None

      Description of problem:

      Noticed an issue while testing nodeAgent loadAffinity setting. Restore is partially failing with error pod is not found in specified node.  Attached error below:- 

        Velero:   node-agent pod is not running in node oadp-137771-bpr9v-worker-b-jhs7p: daemonset pod not found in running state in node oadp-137771-bpr9v-worker-b-jhs7p

       

      Version-Release number of selected component (if applicable):

      Deployed oadp via make deploy command. 

      How reproducible:
      Always

       

      Steps to Reproduce:
      1.   Added a label to one of the worker node 

       

      $ oc get nodes -l foo=bar
      NAME                               STATUS   ROLES    AGE     VERSION
      oadp-137771-bpr9v-worker-a-sgxkh   Ready    worker   5h57m   v1.33.5 

       

      2. Created DPA with nodeAgent loadAffinity spec 

       

      apiVersion: oadp.openshift.io/v1alpha1
      kind: DataProtectionApplication
      metadata:
        creationTimestamp: "2025-11-24T11:32:25Z"
        generation: 1
        name: ts-dpa
        namespace: openshift-adp
        resourceVersion: "154981"
        uid: c3d5573f-f237-4eaa-b0d4-c127ff06bafa
      spec:
        backupLocations:
        - velero:
            credential:
              key: cloud
              name: cloud-credentials-gcp
            default: true
            objectStorage:
              bucket: oadp137771bpr9v
              prefix: velero-e2e-3ea14aaa-c929-11f0-a5be-5ea249a46217
            provider: gcp
        configuration:
          nodeAgent:
            enable: true
            loadAffinity:
            - nodeSelector:
                matchExpressions:
                - key: foo
                  operator: In
                  values:
                  - bar
            restorePVC:
              ignoreDelayBinding: true
            uploaderType: kopia
          velero:
            defaultPlugins:
            - openshift
            - gcp
            - kubevirt
            - hypershift
            disableFsBackup: false
        logFormat: text
        podDnsConfig: {}
        snapshotLocations: []
      status:
        conditions:
        - lastTransitionTime: "2025-11-24T11:32:25Z"
          message: Reconcile complete
          reason: Complete
          status: "True"
          type: Reconciled

      3. Deploy an application on same node 

      $ oc get pod -n test-oadp-683 -o wide
      NAME                     READY   STATUS    RESTARTS   AGE     IP            NODE                               NOMINATED NODE   READINESS GATES
      mysql-67fd7fdff6-9vnbd   1/1     Running   0          7m20s   10.128.2.74   oadp-137771-bpr9v-worker-a-sgxkh   <none>           <none> 

      4. Created FSB backup 

       apiVersion: velero.io/v1
        kind: Backup
        metadata:
          creationTimestamp: "2025-11-24T11:46:40Z"
          generation: 6
          labels:
            velero.io/storage-location: ts-dpa-1
          name: mysql-3f8614d2-c929-11f0-a5be-5ea249a46217
          namespace: openshift-adp
          resourceVersion: "159333"
          uid: 2ddc263d-a492-485c-92a8-c5a75be7ec1c
        spec:
          csiSnapshotTimeout: 10m0s
          defaultVolumesToFsBackup: true
          excludedClusterScopedResources:
          - volumesnapshotcontents.snapshot.storage.k8s.io
          excludedNamespaceScopedResources:
          - volumesnapshots.snapshot.storage.k8s.io
          hooks: {}
          includedNamespaces:
          - test-oadp-683
          itemOperationTimeout: 4h0m0s
          metadata: {}
          snapshotMoveData: false
          storageLocation: ts-dpa-1
          ttl: 720h0m0s
          volumeGroupSnapshotLabelKey: velero.io/volume-group
        status:
          completionTimestamp: "2025-11-24T11:47:04Z"
          expiration: "2025-12-24T11:46:40Z"
          formatVersion: 1.1.0
          hookStatus: {}
          phase: Completed
          progress:
            itemsBackedUp: 43
            totalItems: 43
          startTimestamp: "2025-11-24T11:46:40Z"
          version: 1 

      5. Removed app namespace
      6. Triggered restore

       

      Actual results: 

      Restore partially failed with error node-agent pod not found

       

      $ velero describe restore test-restore -n openshift-adp --details 
      Name:         test-restore
      Namespace:    openshift-adp
      Labels:       <none>
      Annotations:  <none>
      Phase:                       PartiallyFailed (run 'velero restore logs test-restore' for more information)
      Total items to be restored:  26
      Items restored:              26
      Started:    2025-11-24 17:31:45 +0530 IST
      Completed:  2025-11-24 17:31:52 +0530 IST
      Warnings:
        Velero:     <none>
        Cluster:    <none>
        Namespaces:
          test-oadp-683:  could not restore, RoleBinding:system:image-pullers already exists. Warning: the in-cluster version is different than the backed-up version
                          could not restore, ConfigMap:kube-root-ca.crt already exists. Warning: the in-cluster version is different than the backed-up version
                          could not restore, ConfigMap:openshift-service-ca.crt already exists. Warning: the in-cluster version is different than the backed-up version
      Errors:
        Velero:   node-agent pod is not running in node oadp-137771-bpr9v-worker-b-jhs7p: daemonset pod not found in running state in node oadp-137771-bpr9v-worker-b-jhs7p
        Cluster:    <none>
        Namespaces: <none>
      Backup:  mysql-3f8614d2-c929-11f0-a5be-5ea249a46217
      
       

       

      Expected results:

      Restore should be completed successfully

       

      Additional info:

      time="2025-11-24T13:37:27Z" level=error msg="Velero restore error: node-agent pod is not running in node oadp-137771-bpr9v-worker-b-jhs7p: daemonset pod not found in running state in node oadp-137771-bpr9v-worker-b-jhs7p" logSource="pkg/controller/restore_controller.go:602" restore=openshift-adp/test-restore1

       

      $ oc get cm node-agent-ts-dpa -o yaml
      apiVersion: v1
      data:
        node-agent-config: '{"loadAffinity":[{"nodeSelector":{"matchExpressions":[{"key":"foo","operator":"In","values":["bar"]}]}}],"restorePVC":{"ignoreDelayBinding":true},"privilegedFsBackup":true}'
      kind: ConfigMap
      metadata:
        creationTimestamp: "2025-11-24T11:32:25Z"
        labels:
          app.kubernetes.io/component: node-agent-config
          app.kubernetes.io/instance: ts-dpa
          app.kubernetes.io/managed-by: oadp-operator
          openshift.io/oadp: "True"
        name: node-agent-ts-dpa
        namespace: openshift-adp
        ownerReferences:
        - apiVersion: oadp.openshift.io/v1alpha1
          blockOwnerDeletion: true
          controller: true
          kind: DataProtectionApplication
          name: ts-dpa
          uid: c3d5573f-f237-4eaa-b0d4-c127ff06bafa
        resourceVersion: "154955"
        uid: 2e422563-a974-410c-9884-1d2df019e229 

              wnstb Wes Hayutin
              rhn-support-prajoshi Prasad Joshi
              Prasad Joshi Prasad Joshi
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: