Loading...

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: OADP 1.3.3
Affects Version/s: OADP 1.3.1
Component/s: velero
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Fixed in Build:
oadp-operator-bundle-container-1.3.3-12
QEStatus:
ToDo
Intelligence Requested:
Market:

WSJF:
0
Risk Probability:
Very Likely
Risk Score:
0

Workstream:

None

Root Cause:
Unset
Failure Category:
Unknown

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

Velero was performing a restore when the API server was rolling out to a new version. It had trouble connecting to the API server, but eventually, the restore was successful. However, since the API server was still in the middle of rolling out, Velero failed to update the restore CR status and gave up. After the connection was restored, it didn't attempt to update, causing the restore CR to be stuck at "In progress" indefinitely. This can lead to incorrect decisions for other components that rely on the backup/restore CR status to determine completion.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Verero logs:

time="2023-12-08T04:02:21Z" level=warning msg="Cluster resource restore warning: could not restore, CustomResourceDefinition \"klusterlets.operator.open-cluster-management.io\" already exists. Warning: the in-cluster version is different than the backed-up version."  logSource="/remotesource/velero/app/pkg/controller/restore_controller.go:506" restore=openshift-adp/acm-klusterlet
time="2023-12-08T04:02:21Z" level=warning msg="Cluster resource restore warning: refresh discovery after restoring CRDs: Get \"
[https://172.30.0.1:443/api?timeout=32s|https://172.30.0.1/api?timeout=32s]
\": dial tcp 172.30.0.1:443: connect: connection refused" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:506" restore=openshift-adp/acm-klusterlet

################Restore completed
time="2023-12-08T04:02:21Z" level=info msg="restore completed" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:513" restore=openshift-adp/acm-klusterlet

time="2023-12-08T04:02:21Z" level=error msg="Get \"
[https://172.30.0.1:443/api?timeout=32s|https://172.30.0.1/api?timeout=32s]
\": dial tcp 172.30.0.1:443: connect: connection refused" logSource="/remote-source/velero/app/pkg/datamover/datamover.go:143"
time="2023-12-08T04:02:21Z" level=error msg="Error removing VSRs after partially failed restore" error="Get \"
[https://172.30.0.1:443/api?timeout=32s|https://172.30.0.1/api?timeout=32s]
\": dial tcp 172.30.0.1:443: connect: connection refused" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:571"

################FAIL to update restore CR status
time="2023-12-08T04:02:21Z" level=info msg="Error updating restore's final status" Restore=openshift-adp/acm-klusterlet error="Patch \"
[https://172.30.0.1:443/apis/velero.io/v1/namespaces/openshift-adp/restores/acm-klusterlet|https://172.30.0.1/apis/velero.io/v1/namespaces/openshift-adp/restores/acm-klusterlet]
\": dial 
tcp 172.30.0.1:443: connect: connection refused" error.file="/remote-source/velero/app/pkg/controller/restore_controller.go:216" error.function="github.com/vmware-tanzu/velero/pkg/controller.(*restoreReconciler).Reconcile" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:216"
...
################connection is back and it's stable
time="2023-12-08T04:02:58Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=openshift-adp/oadp-2 controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:152"
time="2023-12-08T04:02:58Z" level=info msg="BackupStorageLocations is validc, marking as available" backup-storage-location=openshift-adp/oadp-2 controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:137"

Restore CR stuck at "In progress"

apiVersion: velero.io/v1
kind: Restore
metadata:
  annotations:
    lca.openshift.io/apply-wave: "1"
  creationTimestamp: "2023-12-08T04:01:59Z"
  generation: 4
  labels:
    velero.io/storage-location: default
  name: acm-klusterlet
  namespace: openshift-adp
  resourceVersion: "45514"
  uid: 392ad1c8-2a1a-4228-9eaa-d3cb28101de3
spec:
  backupName: acm-klusterlet
  excludedResources:
  - nodes
  - events
  - events.events.k8s.io
  - backups.velero.io
  - restores.velero.io
  - resticrepositories.velero.io
  - csinodes.storage.k8s.io
  - volumeattachments.storage.k8s.io
  - backuprepositories.velero.io
  hooks: {}
  itemOperationTimeout: 1h0m0s
status:
  phase: InProgress
  progress:
    itemsRestored: 1
    totalItems: 63
  startTimestamp: "2023-12-08T04:01:59Z"

clones

OADP-3227 Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress"

Closed

links to

openshift/velero#315: oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing

openshift/velero#324: oadp-1.3: OADP-4265: Reconcile To Fail: Add backup/restore trackers

RHSA-2024:133301 OpenShift API for Data Protection (OADP) 1.3.3 security and bug fix update

mentioned on

Merge request - Updated 2 upstream sources

Merge request - Updated 3 upstream sources

Merge request - Updated US source to: 3dca118 oadp-1.4: OADP-3227: Reconcile to fail on restore stuck in-progress (#330)

Merge request - Updated US source to: 4b5cf07 oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing (#315)

Merge request - Updated US source to: e703a2b oadp-1.3: OADP-4265: Reconcile To Fail: Add backup/restore trackers (#324)

(4 mentioned on)

1.	[IBM QE-P] Verify Bug OADP-4265 - Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress"	Release Pending	Sonia Garudi
2.	[IBM QE-Z] Verify Bug OADP-4265 - Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress"	Release Pending	Ukthi Prasad
3.	[RedHat QE] Verify Bug OADP-4265 - Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress"	Closed	Prasad Joshi

Details

Description

Description of problem:

Actual results:

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates