Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-975

Documentation Release Note: CSI backup ,namespace with 1000 pods/pvcs failed with timeout error

    XMLWordPrintable

Details

    • False
    • Hide

      None

      Show
      None
    • False
    • ToDo
    • Yes
    • Hide
      A performance regression was introduced via upstream velero when performing very large backup operations with the CSI plugin and more than one thousand PVCs.

      A hard coded timeout [1] set to one minute is not sufficient in large operations. Increasing the timeout and rebuilding the related images has been proven as a workaround to this regression.

      A bug [2] has been filed and will be fixed in OADP-1.2 with a configurable timeout.

      [1] https://github.com/openshift/velero/blob/konveyor-dev/pkg/controller/backup_controller.go --> method: recreateVolumeSnapshotContent(vsc snapshotv1api.VolumeSnapshotContent)
      [2] https://issues.redhat.com/browse/OADP-821
      Show
      A performance regression was introduced via upstream velero when performing very large backup operations with the CSI plugin and more than one thousand PVCs. A hard coded timeout [1] set to one minute is not sufficient in large operations. Increasing the timeout and rebuilding the related images has been proven as a workaround to this regression. A bug [2] has been filed and will be fixed in OADP-1 .2 with a configurable timeout. [1] https://github.com/openshift/velero/blob/konveyor-dev/pkg/controller/backup_controller.go --> method: recreateVolumeSnapshotContent(vsc snapshotv1api.VolumeSnapshotContent) [2] https://issues.redhat.com/browse/OADP-821
    • 0
    • 0
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown

    Description

      Description of problem:

      While running CSI backup of namespace with1000 pods - backup end with the status "PartiallyFailed ".

      Error Message:
      main-backup-scheduler-1000pods-every-2hrs-20220928-082124/backup-scheduler-1000pods-every-2hrs-20220928083022/backup-scheduler-1000pods-every-2hrs-20220928083022.log:time="2022-09-28T10:10:02Z" level=error msg="fail to recreate VolumeSnapshotContent snapcontent-7a455d87-5e00-42ba-b54c-3b16ba91df71: fail to retrieve VolumeSnapshotContent snapcontent-7a455d87-5e00-42ba-b54c-3b16ba91df71 info: timed out waiting for the condition" backup=openshift-adp/backup-scheduler-1000pods-every-2hrs-20220928083022 logSource="pkg/controller/backup_controller.go:985".

      Also running CSI backup of namespace with 80/90/100 pods - All backups were completed.

      Version-Release number of selected component (if applicable):

      OCP 4.10.26

      OADP 1.1.0-74 

      How reproducible:

       

      Steps to Reproduce:
      1. Create ns with 1000pods
      2. Run CSI backup
      3. Check backup status

      Actual results:

      Backup failed with "PartiallyFailed" status

      Expected results:

      Backup passed with "completed" status

      Additional info:

      logs:
      https://drive.google.com/drive/folders/1VxFgiILR_IlYfHhbZJEhlGuC8mid_-Iw?usp=sharing

      Ran a few iterations with 10min timeout - backup completed (Using Private Velero)

      upstream issue: https://github.com/vmware-tanzu/velero/issues/5416

       

       

       

      Attachments

        Activity

          People

            richard.hoch Richard Hoch
            dvaanunu@redhat.com David Vaanunu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: