[OADP-975] Documentation Release Note: CSI backup ,namespace with 1000 pods/pvcs failed with timeout error - Red Hat Issue Tracker

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: OADP 1.1.1
Affects Version/s: OADP 1.1.0
Component/s: Documentation
Labels:
- qe-impact
- triaged

Story Points:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
QEStatus:
ToDo
Release Note Text:

Hide
A performance regression was introduced via upstream velero when performing very large backup operations with the CSI plugin and more than one thousand PVCs.

A hard coded timeout [1] set to one minute is not sufficient in large operations. Increasing the timeout and rebuilding the related images has been proven as a workaround to this regression.

A bug [2] has been filed and will be fixed in ~~OADP-1~~.2 with a configurable timeout.

[1] https://github.com/openshift/velero/blob/konveyor-dev/pkg/controller/backup_controller.go --> method: recreateVolumeSnapshotContent(vsc snapshotv1api.VolumeSnapshotContent)
[2] https://issues.redhat.com/browse/OADP-821

Show
A performance regression was introduced via upstream velero when performing very large backup operations with the CSI plugin and more than one thousand PVCs. A hard coded timeout [1] set to one minute is not sufficient in large operations. Increasing the timeout and rebuilding the related images has been proven as a workaround to this regression. A bug [2] has been filed and will be fixed in OADP-1 .2 with a configurable timeout. [1] https://github.com/openshift/velero/blob/konveyor-dev/pkg/controller/backup_controller.go --> method: recreateVolumeSnapshotContent(vsc snapshotv1api.VolumeSnapshotContent) [2] https://issues.redhat.com/browse/OADP-821

WSJF:
0
Risk Probability:
Very Likely
Risk Score:
0

Workstream:

None

Root Cause:
Unset
Failure Category:
Unknown

Regression:
Yes

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

While running CSI backup of namespace with1000 pods - backup end with the status "PartiallyFailed ".

Error Message:
main-backup-scheduler-1000pods-every-2hrs-20220928-082124/backup-scheduler-1000pods-every-2hrs-20220928083022/backup-scheduler-1000pods-every-2hrs-20220928083022.log:time="2022-09-28T10:10:02Z" level=error msg="fail to recreate VolumeSnapshotContent snapcontent-7a455d87-5e00-42ba-b54c-3b16ba91df71: fail to retrieve VolumeSnapshotContent snapcontent-7a455d87-5e00-42ba-b54c-3b16ba91df71 info: timed out waiting for the condition" backup=openshift-adp/backup-scheduler-1000pods-every-2hrs-20220928083022 logSource="pkg/controller/backup_controller.go:985".

Also running CSI backup of namespace with 80/90/100 pods - All backups were completed.

Version-Release number of selected component (if applicable):

OCP 4.10.26

OADP 1.1.0-74

How reproducible:

Steps to Reproduce:
1. Create ns with 1000pods
2. Run CSI backup
3. Check backup status

Actual results:

Backup failed with "PartiallyFailed" status

Expected results:

Backup passed with "completed" status

Additional info:

logs:
https://drive.google.com/drive/folders/1VxFgiILR_IlYfHhbZJEhlGuC8mid_-Iw?usp=sharing

Ran a few iterations with 10min timeout - backup completed (Using Private Velero)

upstream issue: https://github.com/vmware-tanzu/velero/issues/5416

clones

OADP-821 CSI backup ,namespace with 1000 pods/pvcs failed with timeout error

Closed

links to

openshift/openshift-docs#51819: OADP-778, OADP-804, and OADP-975: Release notes for OADP 1.1.1

openshift/openshift-docs#52659: Release notes for OADP 1.1.1: OADP-778, 804, 939 and 975 (Replaces PR 51819)

Assignee:: Richard Hoch

Reporter:: David Vaanunu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2022/10/24 1:57 PM

Updated:: 2025/03/30 3:26 PM

Resolved:: 2022/11/27 8:16 AM

Details

Description

Description of problem:

While running CSI backup of namespace with1000 pods - backup end with the status "PartiallyFailed ".

OCP 4.10.26

Actual results:

Backup failed with "PartiallyFailed" status

Backup passed with "completed" status

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates