Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-3071

Performance issues when restoring 30k resources at the first time

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Critical Critical
    • OADP 1.5.0
    • OADP 1.3.0
    • velero
    • 1
    • False
    • Hide

      None

      Show
      None
    • False
    • ToDo
    • 0
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown
    • Yes

      Description of problem:

      Following the bug: https://issues.redhat.com/browse/OADP-1167

      While restoring the first time (without existing-resource-policy: update flag) - the time is 55min - double from OADP 1.1.0 results. - it is a regression bug

       

      while restoring the 2nd & 3rd time with existing-resource-policy: update flag - the time is 28min - half from OADP 1.1.0 results.

       

      See https://issues.redhat.com/browse/OADP-1167?focusedId=21683376&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21683376)

      Version-Release number of selected component (if applicable):

      OCP 4.12.9

      ODF 4.12.9-rhodf
      OADP 1.3.0-138

       

      How reproducible:

       

      Steps to Reproduce:
      1. Create namespace with 33K secerts
      2.  Run backup
      3. Delete the namespace
      4. run 1st restore

      5. run a few restores with existing-resource-policy: update flag

      Actual results:

      first restore completed OK but the duration is double from OADP 1.1.0 results 

      Expected results:

      first restore complete OK with at least the same duration as OADP 1.1.0 results

      Additional info:

            [OADP-3071] Performance issues when restoring 30k resources at the first time

            Scott Seago added a comment -

            Just to clarify – the additional time needed to restore new resources is not a consequence of fixing existing resources, it's actually a result of a change added to Velero 1.10 (first introduced in OADP 1.2), before any of the existing resource performance work was done. The reason this takes longer now is that Velero now restores the managed fields struct for resources as well, but this cannot be done in the original `Create` call, as that field is discarded, so the resource must be Updated post-creation, which doubles the number of API calls required per item, resulting in approximately doubling the time required to restore each (non-PVC) resource.

            Scott Seago added a comment - Just to clarify – the additional time needed to restore new resources is not a consequence of fixing existing resources, it's actually a result of a change added to Velero 1.10 (first introduced in OADP 1.2), before any of the existing resource performance work was done. The reason this takes longer now is that Velero now restores the managed fields struct for resources as well, but this cannot be done in the original `Create` call, as that field is discarded, so the resource must be Updated post-creation, which doubles the number of API calls required per item, resulting in approximately doubling the time required to restore each (non-PVC) resource.

            Wes Hayutin added a comment -

            ok.. I spoke to Scott about this bug. We are going to look at it but not in the immediate cycles.  I have to kick this out. https://redhat-internal.slack.com/archives/C0144ECKUJ0/p1704743947152359

            Wes Hayutin added a comment - ok.. I spoke to Scott about this bug. We are going to look at it but not in the immediate cycles.  I have to kick this out. https://redhat-internal.slack.com/archives/C0144ECKUJ0/p1704743947152359

            Wes Hayutin added a comment -

            Yes, performance can always be faster, but the slowdown in performance is a direct result of fixing restoring existing resources in https://issues.redhat.com/browse/OADP-1167

            Wes Hayutin added a comment - Yes, performance can always be faster, but the slowdown in performance is a direct result of fixing restoring existing resources in https://issues.redhat.com/browse/OADP-1167

            Scott Seago added a comment -

            sseago No, the informer cache change improved performance for resources that already exist in the cluster. This is referencing the slowdown in performance for new resources.

            There is no resolution for this, although I wouldn't consider it a regression. Velero fixed a bug (managed fields weren't being set properly), but to do this requires patching the resource post-creation, which means twice as many API calls per resource on restore. Velero is doing more things than before, therefore it takes longer.

            Scott Seago added a comment - sseago No, the informer cache change improved performance for resources that already exist in the cluster. This is referencing the slowdown in performance for new resources. There is no resolution for this, although I wouldn't consider it a regression. Velero fixed a bug (managed fields weren't being set properly), but to do this requires patching the resource post-creation, which means twice as many API calls per resource on restore. Velero is doing more things than before, therefore it takes longer.

            Wes Hayutin added a comment - - edited

            sseago this is fixed in 1.3.0 I believe, or is this also an informer cache setting issue at this point?

            Wes Hayutin added a comment - - edited sseago this is fixed in 1.3.0 I believe, or is this also an informer cache setting issue at this point?

            Wes Hayutin added a comment -

            I believe the code required is in the upstream at this time.  leaving in 1.3.2 for us to double check

            Wes Hayutin added a comment - I believe the code required is in the upstream at this time.  leaving in 1.3.2 for us to double check

            Scott Seago added a comment -

            This is a result of a change made in Velero 1.11/OADP 1.2 to restore managed fields. Managed fields are not set in the create call, so velero has to patch the resource post-creation. As a result, time spent restoring a resource which does not already exist in the cluster takes approx twice as long as in OADP 1.1.

            The upstream issue: https://github.com/vmware-tanzu/velero/issues/5701

            The restore code responsible for this new functionality: https://github.com/vmware-tanzu/velero/blob/main/pkg/restore/restore.go#L1743-L1762

            Scott Seago added a comment - This is a result of a change made in Velero 1.11/OADP 1.2 to restore managed fields. Managed fields are not set in the create call, so velero has to patch the resource post-creation. As a result, time spent restoring a resource which does not already exist in the cluster takes approx twice as long as in OADP 1.1. The upstream issue: https://github.com/vmware-tanzu/velero/issues/5701 The restore code responsible for this new functionality: https://github.com/vmware-tanzu/velero/blob/main/pkg/restore/restore.go#L1743-L1762

            removing 'regression' label and applying 'regression' field.

            Mordechai Lehrer added a comment - removing 'regression' label and applying 'regression' field.

              wnstb Wes Hayutin
              dvaanunu@redhat.com David Vaanunu
              David Vaanunu David Vaanunu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: