Loading...

Type: Sub-task
Resolution: Done
Priority: Undefined
Fix Version/s: OADP 1.3.0
Affects Version/s: None
Component/s: None
Labels:
None

Activity Type:
Quality / Stability / Reliability
Story Points:
4
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
QEStatus:
ToDo
Intelligence Requested:
Market:

WSJF:
0
Risk Probability:
Very Likely
Risk Score:
0

Workstream:

None

Root Cause:
Unset
Failure Category:
Unknown

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

Backups got stuck in WaitingForPluginOperations phase when node agent pod got restarted. Created a DPA with setting smaller cpu and memory limits in order cause a node-agent pod restart.

Version-Release number of selected component (if applicable):
Installed OADP operator using velero-datamover branch

OCP 4.13

How reproducible:

Always

Steps to Reproduce:
1. Create a DPA, set small cpu and memory values for node-agent pod.

  apiVersion: oadp.openshift.io/v1alpha1
  kind: DataProtectionApplication
  metadata:
    name: ts-dpa
    namespace: openshift-adp
    resourceVersion: "137751"
    uid: 337295a4-8978-4afc-81c5-6470238f9a16
  spec:
    backupLocations:
    - velero:
        credential:
          key: cloud
          name: cloud-credentials-gcp
        default: true
        objectStorage:
          bucket: oadpbucket227213
          prefix: velero-e2e-be4635b3-40d7-11ee-88d9-845cf3eff33a
        provider: gcp
    configuration:
      restic:
        enable: true
        podConfig:
          resourceAllocations:
            limits:
              cpu: 100m
              memory: 50Mi
            requests:
              cpu: 50m
              memory: 10Mi
      velero:
        defaultPlugins:
        - openshift
        - gcp
        - kubevirt
        - csi
    podDnsConfig: {}
    snapshotLocations: []
  status:
    conditions:
    - lastTransitionTime: "2023-08-22T10:36:30Z"
      message: Reconcile complete
      reason: Complete
      status: "True"
      type: Reconciled

2. Deploy a stateful application. Make sure it has some data around 1 or 2GB
3. Execute backup with native datamover.

$ oc get backup test-backup2 -o yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
  annotations:
    velero.io/resource-timeout: 10m0s
    velero.io/source-cluster-k8s-gitversion: v1.26.6+6bf3f75
    velero.io/source-cluster-k8s-major-version: "1"
    velero.io/source-cluster-k8s-minor-version: "26"
  labels:
    velero.io/storage-location: ts-dpa-1
  name: test-backup2
  namespace: openshift-adp
  resourceVersion: "139085"
  uid: eab99b56-a863-4a04-b96d-125b29ac88c4
spec:
  csiSnapshotTimeout: 10m0s
  defaultVolumesToFsBackup: false
  includedNamespaces:
  - attached-pvc
  itemOperationTimeout: 4h0m0s
  snapshotMoveData: true
  storageLocation: ts-dpa-1
  ttl: 720h0m0s
status:
  backupItemOperationsAttempted: 1
  expiration: "2023-09-21T10:55:59Z"
  formatVersion: 1.1.0
  phase: WaitingForPluginOperations
  progress:
    itemsBackedUp: 39
    totalItems: 39
  startTimestamp: "2023-08-22T10:55:59Z"
  version: 1

4. Verify Node Agent pod has 1 restart count.

$ oc get pod
NAME                                                              READY   STATUS      RESTARTS      AGE
e9ec5d38e74536139e4d5027ce0c05a12f40d7595138a4987f0fa76035cs6jd   0/1     Completed   0             4h42m
node-agent-8mrts                                                  1/1     Running     0             48m
node-agent-fgsdw                                                  1/1     Running     1 (25m ago)   48m
node-agent-mdgqg                                                  1/1     Running     0             48m
openshift-adp-controller-manager-54877df7c4-wtbtw                 1/1     Running     0             4h41m
test-backup2-8vb28                                                1/1     Running     0             27m
ttl-sh-oadp-operator-bundle-3a42929-1h                            1/1     Running     0             4h42m
velero-5d6cbbc5d-vhmpk                                            1/1     Running     0             32m

Actual results:

Backup gets stuck in waitingForPluginOperations phase.

DataUpload is also in InProgress status, looks like cancel true is not taking any effect.

$ oc get dataupload test-backup2-8vb28 -o yaml
apiVersion: velero.io/v2alpha1
kind: DataUpload
metadata:
  creationTimestamp: "2023-08-22T10:56:04Z"
  generateName: test-backup2-
  generation: 5
  labels:
    velero.io/async-operation-id: du-eab99b56-a863-4a04-b96d-125b29ac88c4.fa094c8b-8308-488f5f389
    velero.io/backup-name: test-backup2
    velero.io/backup-uid: eab99b56-a863-4a04-b96d-125b29ac88c4
    velero.io/pvc-uid: fa094c8b-8308-488f-af41-d77c2278244f
  managedFields:
  - apiVersion: velero.io/v2alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName: {}
        f:labels:
          .: {}
          f:velero.io/async-operation-id: {}
          f:velero.io/backup-name: {}
          f:velero.io/backup-uid: {}
          f:velero.io/pvc-uid: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"eab99b56-a863-4a04-b96d-125b29ac88c4"}: {}
      f:spec:
        .: {}
        f:backupStorageLocation: {}
        f:csiSnapshot:
          .: {}
          f:snapshotClass: {}
          f:storageClass: {}
          f:volumeSnapshot: {}
        f:operationTimeout: {}
        f:snapshotType: {}
        f:sourceNamespace: {}
        f:sourcePVC: {}
      f:status:
        .: {}
        f:progress: {}
    manager: velero-plugin-for-csi
    operation: Update
    time: "2023-08-22T10:56:04Z"
  - apiVersion: velero.io/v2alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:cancel: {}
      f:status:
        f:message: {}
        f:node: {}
        f:phase: {}
        f:startTimestamp: {}
    manager: node-agent-server
    operation: Update
    time: "2023-08-22T10:59:05Z"
  name: test-backup2-8vb28
  namespace: openshift-adp
  ownerReferences:
  - apiVersion: velero.io/v1
    controller: true
    kind: Backup
    name: test-backup2
    uid: eab99b56-a863-4a04-b96d-125b29ac88c4
  resourceVersion: "140453"
  uid: 11b8f306-ba4b-4389-8213-cc18830a2159
spec:
  backupStorageLocation: ts-dpa-1
  cancel: true
  csiSnapshot:
    snapshotClass: ""
    storageClass: standard-csi
    volumeSnapshot: velero-volume1-2w8w6
  operationTimeout: 10m0s
  snapshotType: CSI
  sourceNamespace: attached-pvc
  sourcePVC: volume1
status:
  message: found a dataupload with status "InProgress" during the node-agent starting, mark it as cancel
  node: oadp-51120-9pcz7-worker-b-ltj2x.c.openshift-qe.internal
  phase: InProgress
  progress: {}
  startTimestamp: "2023-08-22T10:56:04Z"

Expected results:

Backup should be marked as partiallyFailed when Node Agent pod gets restarted.

Additional info:
Node agent pod logs

$ oc logs node-agent-fgsdw -f
time="2023-08-22T10:59:04Z" level=info msg="Setting log-level to INFO"
time="2023-08-22T10:59:04Z" level=info msg="Starting Velero node-agent server konveyor-dev (-)" logSource="pkg/cmd/cli/nodeagent/server.go:103"
1.6927019443434076e+09    INFO    controller-runtime.metrics    Metrics server is starting to listen    {"addr": ":8080"}
time="2023-08-22T10:59:04Z" level=info msg="Starting metric server for node agent at address []" logSource="pkg/cmd/cli/nodeagent/server.go:229"
time="2023-08-22T10:59:04Z" level=info msg="Starting controllers" logSource="pkg/cmd/cli/nodeagent/server.go:245"
time="2023-08-22T10:59:05Z" level=warning dataupload=test-backup2-8vb28 logSource="pkg/cmd/cli/nodeagent/server.go:378"
time="2023-08-22T10:59:05Z" level=info msg="Controllers starting..." logSource="pkg/cmd/cli/nodeagent/server.go:287"
1.6927019456482856e+09    INFO    Starting server    {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6927019456483958e+09    INFO    Starting EventSource    {"controller": "podvolumebackup", "controllerGroup": "velero.io", "controllerKind": "PodVolumeBackup", "source": "kind source: *v1.PodVolumeBackup"}
1.692701945648438e+09    INFO    Starting Controller    {"controller": "podvolumebackup", "controllerGroup": "velero.io", "controllerKind": "PodVolumeBackup"}
1.692701945648423e+09    INFO    Starting EventSource    {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload", "source": "kind source: *v2alpha1.DataDownload"}
1.6927019456484365e+09    INFO    Starting EventSource    {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "source": "kind source: *v2alpha1.DataUpload"}
1.6927019456484556e+09    INFO    Starting EventSource    {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload", "source": "kind source: *v2alpha1.DataDownloadList"}
1.6927019456484594e+09    INFO    Starting EventSource    {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "source": "kind source: *v2alpha1.DataUploadList"}
1.6927019456484637e+09    INFO    Starting EventSource    {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload", "source": "kind source: *v1.Pod"}
1.6927019456484663e+09    INFO    Starting EventSource    {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "source": "kind source: *v1.Pod"}
1.6927019456484683e+09    INFO    Starting Controller    {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload"}
1.692701945648471e+09    INFO    Starting Controller    {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload"}
1.6927019456484087e+09    INFO    Starting EventSource    {"controller": "podvolumerestore", "controllerGroup": "velero.io", "controllerKind": "PodVolumeRestore", "source": "kind source: *v1.PodVolumeRestore"}
1.6927019456484845e+09    INFO    Starting EventSource    {"controller": "podvolumerestore", "controllerGroup": "velero.io", "controllerKind": "PodVolumeRestore", "source": "kind source: *v1.Pod"}
1.6927019456484897e+09    INFO    Starting Controller    {"controller": "podvolumerestore", "controllerGroup": "velero.io", "controllerKind": "PodVolumeRestore"}
time="2023-08-22T10:59:05Z" level=info msg="Reconcile test-backup2-8vb28" controller=dataupload dataupload=openshift-adp/test-backup2-8vb28 logSource="pkg/controller/data_upload_controller.go:109"
time="2023-08-22T10:59:05Z" level=info msg="Data upload is in progress" controller=dataupload dataupload=openshift-adp/test-backup2-8vb28 logSource="pkg/controller/data_upload_controller.go:226"
time="2023-08-22T10:59:05Z" level=info msg="PodVolumeBackup starting" backup=openshift-adp/backup1-60e85156-40ce-11ee-af13-845cf3eff33a controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:104" podvolumebackup=openshift-adp/backup1-60e85156-40ce-11ee-af13-845cf3eff33a-vlh45
time="2023-08-22T10:59:05Z" level=info msg="PodVolumeBackup starting" backup=openshift-adp/backup1-00d35092-40cf-11ee-af13-845cf3eff33a controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:104" podvolumebackup=openshift-adp/backup1-00d35092-40cf-11ee-af13-845cf3eff33a-lj6tv
1.6927019458372142e+09    INFO    Starting workers    {"controller": "podvolumebackup", "controllerGroup": "velero.io", "controllerKind": "PodVolumeBackup", "worker count": 1}
1.6927019458372939e+09    INFO    Starting workers    {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "worker count": 1}
1.6927019458374288e+09    INFO    Starting workers    {"controller": "podvolumerestore", "controllerGroup": "velero.io", "controllerKind": "PodVolumeRestore", "worker count": 1}
1.692701945837552e+09    INFO    Starting workers    {"controller": "datadownload", "controllerGroup": "velero.io", "controllerKind": "DataDownload", "worker count": 1}

Details

Description

Description of problem:

Actual results:

DataUpload is also in InProgress status, looks like cancel true is not taking any effect.

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide