Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-611

Data mover VSR resources are sometimes created multiple times with multiple PVCs

    • False
    • oadp-volume-snapshot-mover-container-1.1.1-9
    • In Progress
    • 0
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown
    • No

      During restore, backing up a namespace with 1 volume and data mover enabled, it was observed that a replicationDestination could sometimes be created more than once during the restore runtime. It was 50% reproducible at the time, but not reproducible since the latest build of data mover.

      As of 8/1 developers are still reporting seeing this issue sporadically. Unsure how to reproduce just yet. Keeping bug open until root problem is addressed.

            [OADP-611] Data mover VSR resources are sometimes created multiple times with multiple PVCs

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory, and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2022:8634

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:8634

            Maya Peretz added a comment - - edited

            after 148 iterations of a test running cleanup then restores of multi-pvc-app (3 PVCs), could not reproduce this problem (first iteration failed due to something else). 

            On the other hand, it was not reproducible using this test on 1.1.0 using Volsync 0.5.0. 

            I haven't seen any extra snapshots created (444/3 = 148; fits exactly to the number of iterations):

            [mperetz@fedora ~]$ oc get volumesnapshotcontents.snapshot.storage.k8s.io  -l velero.io/restore-name --no-headers | wc -l # iteration 148
            444
            [mperetz@fedora ~]$ oc get volumesnapshotcontents.snapshot.storage.k8s.io -l '!velero.io/backup-name' | grep -c snapcontent 
            444
            [mperetz@fedora ~]$ 
             
            
            
            STEP: Validate the application after restore 11/09/22 05:17:30.538     2022/11/09 05:17:30 ***************************************************************************************************************************************
            2022/11/09 05:17:30 Number of successful iterations: 147  
            2022/11/09 05:17:30 VSR regex map count map[ReplicationDestination.volsync.backube.*not.*found:0 replicationdestinations.volsync.backube.*already.*exists:0 secrets.*already.*exists:0]  
            2022/11/09 05:17:30 RD regex map count map[a.*replication.*method.*must.*be.*specified:0]
            STEP: Delete the appplication resources test-849 11/09/22 05:17:30.538
            
            
            

            test:

            https://gitlab.cee.redhat.com/app-mig/oadp-e2e-qe/-/merge_requests/215

            emcmulla@redhat.com, rhn-engineering-dymurray, wnstb, since we were not able to reproduce it on 1.1.0 using Volsync 0.5.0, would be helpful if you'll provide further details for this scenario, such as: how many PVCs exactly did you use when you hit this?  if you could provide the specific app used, that would be great.

            ---------------------------------------------------------------------------------------------------------------------------

            Following https://coreos.slack.com/archives/C0144ECKUJ0/p1667980749532509, moving RedHat QE's sub-task to verified (Release Pending).

            Maya Peretz added a comment - - edited after 148 iterations of a test running cleanup then restores of multi-pvc-app (3 PVCs), could not reproduce this problem (first iteration failed due to something else).  On the other hand, it was not reproducible using this test on 1.1.0 using Volsync 0.5.0.  I haven't seen any extra snapshots created (444/3 = 148; fits exactly to the number of iterations): [mperetz@fedora ~]$ oc get volumesnapshotcontents.snapshot.storage.k8s.io  -l velero.io/restore-name --no-headers | wc -l # iteration 148 444 [mperetz@fedora ~]$ oc get volumesnapshotcontents.snapshot.storage.k8s.io -l '!velero.io/backup-name' | grep -c snapcontent  444 [mperetz@fedora ~]$  STEP: Validate the application after restore 11/09/22 05:17:30.538     2022/11/09 05:17:30 *************************************************************************************************************************************** 2022/11/09 05:17:30 Number of successful iterations: 147 2022/11/09 05:17:30 VSR regex map count map[ReplicationDestination.volsync.backube.*not.*found:0 replicationdestinations.volsync.backube.*already.*exists:0 secrets.*already.*exists:0] 2022/11/09 05:17:30 RD regex map count map[a.*replication.*method.*must.*be.*specified:0] STEP: Delete the appplication resources test-849 11/09/22 05:17:30.538 test: https://gitlab.cee.redhat.com/app-mig/oadp-e2e-qe/-/merge_requests/215 emcmulla@redhat.com , rhn-engineering-dymurray , wnstb , since we were not able to reproduce it on 1.1.0 using Volsync 0.5.0, would be helpful if you'll provide further details for this scenario, such as: how many PVCs exactly did you use when you hit this?  if you could provide the specific app used, that would be great. --------------------------------------------------------------------------------------------------------------------------- Following https://coreos.slack.com/archives/C0144ECKUJ0/p1667980749532509, moving RedHat QE's sub-task to verified (Release Pending).

            Maya Peretz mentioned this issue in a merge request of app-mig / oadp-e2e-qe on branch bug_849:

            Test to verify OADP-611, OADP-1016 and OADP-849

            GitLab CEE Bot added a comment - Maya Peretz mentioned this issue in a merge request of app-mig / oadp-e2e-qe on branch bug_849 : Test to verify OADP-611 , OADP-1016 and OADP-849

            Maya Peretz added a comment -

            wnstb that was intentional this time, please check my last comment ^^

            Maya Peretz added a comment - wnstb that was intentional this time, please check my last comment ^^

            mperetz@redhat.com FYI.. qe automation moved from on_qa to assigned here. 

            Wes Hayutin added a comment - mperetz@redhat.com FYI.. qe automation moved from on_qa to assigned here. 

            Maya Peretz added a comment - - edited

            emcmulla@redhat.com/shawnhurley/wnstb   it's kinda hard to verify this at the moment, as with the multiple pvc application I hit this bug more often for somewhat: https://issues.redhat.com/browse/OADP-928 

            Anyway, I have refactored the code related to datamover and added cassandra app to cover a scenario with multiple PVCs: https://gitlab.cee.redhat.com/app-mig/oadp-e2e-qe/-/blob/master/e2e/app_backup/backup_restore_datamover.go#L141

            I will move this  bug to Assigned. Please move it backup to ON_QA once https://issues.redhat.com/browse/OADP-928 is resolved

            tested on build: oadp-operator-bundle-container-1.1.1-26

            Maya Peretz added a comment - - edited emcmulla@redhat.com / shawnhurley / wnstb   it's kinda hard to verify this at the moment, as with the multiple pvc application I hit this bug more often for somewhat: https://issues.redhat.com/browse/OADP-928   Anyway, I have refactored the code related to datamover and added cassandra app to cover a scenario with multiple PVCs: https://gitlab.cee.redhat.com/app-mig/oadp-e2e-qe/-/blob/master/e2e/app_backup/backup_restore_datamover.go#L141 I will move this  bug to Assigned. Please move it backup to ON_QA once https://issues.redhat.com/browse/OADP-928 is resolved tested on build: oadp-operator-bundle-container-1.1.1-26

            shawnhurley I think if we add an app with multiple PVCs to the current data mover e2e test, that would suffice. Although currently this test is blocked by a Volsync bug, with a fix released near the end of October afaik.

            Emily McMullan added a comment - shawnhurley I think if we add an app with multiple PVCs to the current data mover e2e test, that would suffice. Although currently this test is blocked by a Volsync bug, with a fix released near the end of October afaik.

            emcmulla@redhat.com spampatt@redhat.com is there an e2e test that we can write to that causes the failure and then validates this? even if the test only hits 1 out of every 2 runs, getting signal overtime that this does fix the problem will be great

            Shawn Hurley added a comment - emcmulla@redhat.com spampatt@redhat.com  is there an e2e test that we can write to that causes the failure and then validates this? even if the test only hits 1 out of every 2 runs, getting signal overtime that this does fix the problem will be great

            wnstb akarol@redhat.com checking the VSR resources for multiples may be difficult because cleanup happens right after a VSR completes, which is before the restore completes. If this issue does happen though you will see multiple Volsync volumeSnapshotContents per VSR at the end of restore.

             

            If everything works correctly then there should be 2 volumeSnapshotContents per PVC (volsync and velero) at the end of restore. So if there are more than that, multiple VSR resources were creating during the process.

            Emily McMullan added a comment - wnstb akarol@redhat.com checking the VSR resources for multiples may be difficult because cleanup happens right after a VSR completes, which is before the restore completes. If this issue does happen though you will see multiple Volsync volumeSnapshotContents per VSR at the end of restore.   If everything works correctly then there should be 2 volumeSnapshotContents per PVC (volsync and velero) at the end of restore. So if there are more than that, multiple VSR resources were creating during the process.

            Wes Hayutin added a comment -

            akarol@redhat.com emcmulla@redhat.com so for testing how does the following sound?

            Add another row to the table here [1] where the setup for the app has multiple pvc's?  Perhaps we need an app that mounts two pv's and has two pvcs and to execute multiple backup and restores to try and recreate.

            WDYT?

            [1] https://gitlab.cee.redhat.com/app-mig/oadp-e2e-qe/-/blob/master/e2e/app_backup/backup_restore_datamover.go#L90

            Wes Hayutin added a comment - akarol@redhat.com emcmulla@redhat.com so for testing how does the following sound? Add another row to the table here [1] where the setup for the app has multiple pvc's?  Perhaps we need an app that mounts two pv's and has two pvcs and to execute multiple backup and restores to try and recreate. WDYT? [1] https://gitlab.cee.redhat.com/app-mig/oadp-e2e-qe/-/blob/master/e2e/app_backup/backup_restore_datamover.go#L90

              emcmulla@redhat.com Emily McMullan
              rhn-engineering-dymurray Dylan Murray
              Maya Peretz Maya Peretz
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: