Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: OADP 1.5.0
Affects Version/s: OADP 1.4.1
Component/s: data-mover
Labels:
- oadp_datamover
- triaged

Story Points:
4
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
QEStatus:
ToDo
Intelligence Requested:
Market:

Cost of Delay:
0
WSJF:
0.000
Risk Probability:
Very Likely
Risk Score:
0

Workstream:

None

Root Cause:
Unset
Failure Category:
Unknown

Regression:
None

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem:

Restorer pods are getting scheduled on wrong node which causing datadownload to get stuck until the timeout is hit. Issue looks intermittent sometime it works sometime it doesn't. The issue only seen with restore. Attached details below:-

$ oc get node -l test=node -o wide
NAME                              STATUS   ROLES    AGE    VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
oadp-94941-29pqq-worker-c-bxkq8   Ready    worker   109m   v1.29.7+4510e9c   10.0.128.4    <none>        Red Hat Enterprise Linux CoreOS 416.94.202408200132-0   5.14.0-427.33.1.el9_4.x86_64   cri-o://1.29.7-5.rhaos4.16.gitb130ec5.el9

$ oc get pod -l velero.io/data-download -o wide 
NAME                  READY   STATUS    RESTARTS   AGE     IP            NODE                              NOMINATED NODE   READINESS GATES
test-restore2-7m87f   1/1     Running   0          9m59s   10.128.2.28   oadp-94941-29pqq-worker-b-tr5lh   <none>           <none>
test-restore2-95j98   1/1     Running   0          10m     10.128.2.26   oadp-94941-29pqq-worker-b-tr5lh   <none>           <none>
test-restore2-fmxln   1/1     Running   0          10m     10.128.2.23   oadp-94941-29pqq-worker-b-tr5lh   <none>           <none>
test-restore2-frzdz   1/1     Running   0          10m     10.128.2.25   oadp-94941-29pqq-worker-b-tr5lh   <none>           <none>
test-restore2-jdnrn   1/1     Running   0          10m     10.128.2.27   oadp-94941-29pqq-worker-b-tr5lh   <none>           <none>
test-restore2-khdqf   1/1     Running   0          9m59s   10.128.2.29   oadp-94941-29pqq-worker-b-tr5lh   <none>           <none>
test-restore2-nmwfj   1/1     Running   0          10m     10.128.2.24   oadp-94941-29pqq-worker-b-tr5lh   <none>           <none>
test-restore2-tr4ts   1/1     Running   0          9m58s   10.128.2.30   oadp-94941-29pqq-worker-b-tr5lh   <none>           <none>

Also verified the node has sufficient amount of memory and CPU.

Version-Release number of selected component (if applicable):
OADP 1.4.1-20

How reproducible:

Intermittent(I noticed sometime the test pass as usual)

Steps to Reproduce:
1. Added a label on one of the worker node

$ oc get node -l test=node -o wide
NAME                              STATUS   ROLES    AGE    VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
oadp-94941-29pqq-worker-c-bxkq8   Ready    worker   114m   v1.29.7+4510e9c   10.0.128.4    <none>        Red Hat Enterprise Linux CoreOS 416.94.202408200132-0   5.14.0-427.33.1.el9_4.x86_64   cri-o://1.29.7-5.rhaos4.16.gitb130ec5.el9

2. Created configmap with follow spec

$ cat node-agent-config.json 
{
    "loadAffinity": [
        {
            "nodeSelector": {
                "matchLabels": {
                    "test": "node"
                }
            }
        }
    ]
}

$ oc create cm node-agent-config -n openshift-adp-2 --from-file=node-agent-config.json

3. Created DPA with nodeAgent nodeSelector field set.

oc get dpa ts-dpa -o yaml
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
  creationTimestamp: "2024-08-27T06:35:02Z"
  generation: 2
  name: ts-dpa
  namespace: openshift-adp
  resourceVersion: "48062"
  uid: 5d8765c5-6ac0-454f-91d0-cd1709e1f044
spec:
  backupLocations:
  - velero:
      credential:
        key: cloud
        name: cloud-credentials-gcp
      default: true
      objectStorage:
        bucket: oadp9494129pqq
        prefix: velero
      provider: gcp
  configuration:
    nodeAgent:
      enable: true
      podConfig:
        nodeSelector:
          test: node
      uploaderType: kopia
    velero:
      defaultPlugins:
      - gcp
      - openshift
      - csi
status:
  conditions:
  - lastTransitionTime: "2024-08-27T06:35:23Z"
    message: Reconcile complete
    reason: Complete
    status: "True"
    type: Reconciled

4. Deployed an application with 8pvcs.

$ appm deploy ocp-8pvc-app

5. Triggered dataMover backup. Backup got completed successfully.

$ oc get backup test-backup-2 -o yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
  annotations:
    velero.io/resource-timeout: 10m0s
    velero.io/source-cluster-k8s-gitversion: v1.29.7+4510e9c
    velero.io/source-cluster-k8s-major-version: "1"
    velero.io/source-cluster-k8s-minor-version: "29"
  creationTimestamp: "2024-08-27T07:04:14Z"
  generation: 14
  labels:
    velero.io/storage-location: ts-dpa-1
  name: test-backup-2
  namespace: openshift-adp
  resourceVersion: "62170"
  uid: 37dac656-dee5-4f01-a807-2891ca927d60
spec:
  csiSnapshotTimeout: 10m0s
  defaultVolumesToFsBackup: false
  includedNamespaces:
  - ocp-8pvc-app
  itemOperationTimeout: 4h0m0s
  snapshotMoveData: true
  storageLocation: ts-dpa-1
  ttl: 720h0m0s
status:
  backupItemOperationsAttempted: 8
  backupItemOperationsCompleted: 8
  completionTimestamp: "2024-08-27T07:07:49Z"
  expiration: "2024-09-26T07:04:14Z"
  formatVersion: 1.1.0
  hookStatus: {}
  phase: Completed
  progress:
    itemsBackedUp: 84
    totalItems: 84
  startTimestamp: "2024-08-27T07:04:14Z"
  version: 1

$ oc get dataupload
NAME                  STATUS      STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE   NODE
test-backup-2-ckt4p   Completed   25m       104857654    104857654     ts-dpa-1           27m   oadp-94941-29pqq-worker-c-bxkq8
test-backup-2-csvz2   Completed   25m       104857654    104857654     ts-dpa-1           27m   oadp-94941-29pqq-worker-c-bxkq8
test-backup-2-jbrnp   Completed   26m       104857654    104857654     ts-dpa-1           28m   oadp-94941-29pqq-worker-c-bxkq8
test-backup-2-jmbk9   Completed   27m       104857654    104857654     ts-dpa-1           28m   oadp-94941-29pqq-worker-c-bxkq8
test-backup-2-lzwlb   Completed   25m       104857654    104857654     ts-dpa-1           27m   oadp-94941-29pqq-worker-c-bxkq8
test-backup-2-r946f   Completed   26m       104857654    104857654     ts-dpa-1           27m   oadp-94941-29pqq-worker-c-bxkq8
test-backup-2-tq2lz   Completed   26m       104857654    104857654     ts-dpa-1           28m   oadp-94941-29pqq-worker-c-bxkq8
test-backup-2-wjmvv   Completed   25m       104857654    104857654     ts-dpa-1           28m   oadp-94941-29pqq-worker-c-bxkq8

6. Delete app namespace and triggered restore.

$ oc delete ns ocp-8pvc-app

7. Triggered restore with itemOperationsTimeout set to 10m otherwise the restore waits for around 4 hours.

$ oc get restore test-restore2 -o yaml
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: test-restore2
  namespace: openshift-adp
spec:
  backupName: test-backup-2
  itemOperationTimeout: 0h10m0s

Actual results:

Restore is partially failed as the restorer pods are scheduled on the wrong node.

 oc get restore test-restore2 -o yaml
apiVersion: velero.io/v1
kind: Restore
metadata:
  creationTimestamp: "2024-08-27T09:46:19Z"
  finalizers:
  - restores.velero.io/external-resources-finalizer
  generation: 9
  name: test-restore2
  namespace: openshift-adp
  resourceVersion: "123227"
  uid: 30b581e9-f7a6-4763-ad68-a32925953400
spec:
  backupName: test-backup-2
  excludedResources:
  - nodes
  - events
  - events.events.k8s.io
  - backups.velero.io
  - restores.velero.io
  - resticrepositories.velero.io
  - csinodes.storage.k8s.io
  - volumeattachments.storage.k8s.io
  - backuprepositories.velero.io
  itemOperationTimeout: 0h10m0s
status:
  completionTimestamp: "2024-08-27T09:56:30Z"
  errors: 8
  hookStatus: {}
  phase: PartiallyFailed
  progress:
    itemsRestored: 44
    totalItems: 44
  restoreItemOperationsAttempted: 8
  restoreItemOperationsFailed: 8
  startTimestamp: "2024-08-27T09:46:19Z"
  warnings: 6

DataDownload CR has the correct node name in velero.io/accepted-by label but the pod was scheduled on the wrong node.

$  oc get datadownload test-restore2-4s24f -o yaml
apiVersion: velero.io/v2alpha1
kind: DataDownload
metadata:
  creationTimestamp: "2024-08-27T09:46:21Z"
  generateName: test-restore2-
  generation: 4
  labels:
    velero.io/accepted-by: oadp-94941-29pqq-worker-c-jhkvw
    velero.io/async-operation-id: dd-30b581e9-f7a6-4763-ad68-a32925953400.bc96517c-e2a7-48c320af0
    velero.io/restore-name: test-restore2
    velero.io/restore-uid: 30b581e9-f7a6-4763-ad68-a32925953400
  name: test-restore2-4s24f
  namespace: openshift-adp
  ownerReferences:
  - apiVersion: velero.io/v1
    controller: true
    kind: Restore
    name: test-restore2
    uid: 30b581e9-f7a6-4763-ad68-a32925953400
  resourceVersion: "123293"
  uid: afd10eb7-5a9c-4ba2-8621-3c75922add63
spec:
  backupStorageLocation: ts-dpa-1
  cancel: true
  operationTimeout: 10m0s
  snapshotID: 3a11833f1e76eae14fd73de3c6a60963
  sourceNamespace: ocp-8pvc-app
  targetVolume:
    namespace: ocp-8pvc-app
    pv: ""
    pvc: volume6
status:
  completionTimestamp: "2024-08-27T09:56:30Z"
  phase: Canceled
  progress: {}
  startTimestamp: "2024-08-27T09:56:30Z"

Expected results:

Restore should complete successfully.

Additional info:

Attached datadownload yaml below.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

cpu_memory_usage_worker_node.png
2024/08/27 7:21 AM
17 kB
Prasad Joshi

is cloned by

OADP-5260 Documentation: DataMover restore partially fails when node selector spec is used

links to

https://github.com/vmware-tanzu/velero/issues/8242

mentioned on

Merge request - Merge branch 'OADP-4743' into 'master'

Assignee:: Shubham Pampattiwar

Reporter:: Prasad Joshi

QA Contact:: Prasad Joshi

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/08/27 7:35 AM

Updated:: 2024/11/22 11:10 AM

Details

Description

Description of problem:

Restorer pods are getting scheduled on wrong node which causing datadownload to get stuck until the timeout is hit. Issue looks intermittent sometime it works sometime it doesn't. The issue only seen with restore. Attached details below:-

Actual results:

Restore is partially failed as the restorer pods are scheduled on the wrong node.

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates