[OADP-611] Data mover VSR resources are sometimes created multiple times with multiple PVCs

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: OADP 1.1.1
Affects Version/s: OADP 1.1.1
Component/s: data-mover
Labels:

Ready:
False
Fixed in Build:
oadp-volume-snapshot-mover-container-1.1.1-9
QEStatus:
In Progress

WSJF:
0
Risk Probability:
Very Likely
Risk Score:
0

Workstream:

None

Root Cause:
Unset
Failure Category:
Unknown

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

During restore, backing up a namespace with 1 volume and data mover enabled, it was observed that a replicationDestination could sometimes be created more than once during the restore runtime. It was 50% reproducible at the time, but not reproducible since the latest build of data mover.

As of 8/1 developers are still reporting seeing this issue sporadically. Unsure how to reproduce just yet. Keeping bug open until root problem is addressed.

is blocked by

OADP-849 DataMover: restore PartiallyFails randomly with "ReplicationDestination.volsync.backube xxxx not found" error

Closed

is depended on by

OADP-1015 Add at least two volumes to OADP tests/e2e

Release Pending

mentioned on

Merge request - added replication source/destination validation

Merge request - changing datamover backup-restore suite to effectively catch issues like OADP-611

Merge request - Test to verify OADP-611, OADP-1016 and OADP-849

1.	[RedHat QE] Verify Bug OADP-611 - Data mover VSR resources are sometimes created multiple times with multiple PVCs	Release Pending	Maya Peretz
2.	[IBM QE-P] Verify Bug OADP-611 - Data mover VSR resources are sometimes created multiple times with multiple PVCs	Release Pending	Sonia Garudi
3.	[IBM QE-Z] Verify Bug OADP-611 - Data mover VSR resources are sometimes created multiple times with multiple PVCs	Release Pending	Maya Anilson (Inactive)

Errata Tool added a comment - 2022/11/28 2:52 AM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2022:8634

Errata Tool added a comment - 2022/11/28 2:52 AM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:8634

Maya Peretz added a comment - 2022/11/09 5:03 AM - edited

after 148 iterations of a test running cleanup then restores of multi-pvc-app (3 PVCs), could not reproduce this problem (first iteration failed due to something else).

On the other hand, it was not reproducible using this test on 1.1.0 using Volsync 0.5.0.

I haven't seen any extra snapshots created (444/3 = 148; fits exactly to the number of iterations):

[mperetz@fedora ~]$ oc get volumesnapshotcontents.snapshot.storage.k8s.io  -l velero.io/restore-name --no-headers | wc -l # iteration 148
444
[mperetz@fedora ~]$ oc get volumesnapshotcontents.snapshot.storage.k8s.io -l '!velero.io/backup-name' | grep -c snapcontent 
444
[mperetz@fedora ~]$

STEP: Validate the application after restore 11/09/22 05:17:30.538     2022/11/09 05:17:30 ***************************************************************************************************************************************
2022/11/09 05:17:30 Number of successful iterations: 147  
2022/11/09 05:17:30 VSR regex map count map[ReplicationDestination.volsync.backube.*not.*found:0 replicationdestinations.volsync.backube.*already.*exists:0 secrets.*already.*exists:0]  
2022/11/09 05:17:30 RD regex map count map[a.*replication.*method.*must.*be.*specified:0]
STEP: Delete the appplication resources test-849 11/09/22 05:17:30.538

test:

https://gitlab.cee.redhat.com/app-mig/oadp-e2e-qe/-/merge_requests/215

emcmulla@redhat.com, rhn-engineering-dymurray, wnstb, since we were not able to reproduce it on 1.1.0 using Volsync 0.5.0, would be helpful if you'll provide further details for this scenario, such as: how many PVCs exactly did you use when you hit this? if you could provide the specific app used, that would be great.

---------------------------------------------------------------------------------------------------------------------------

Following https://coreos.slack.com/archives/C0144ECKUJ0/p1667980749532509, moving RedHat QE's sub-task to verified (Release Pending).

Maya Peretz added a comment - 2022/11/09 5:03 AM - edited after 148 iterations of a test running cleanup then restores of multi-pvc-app (3 PVCs), could not reproduce this problem (first iteration failed due to something else). On the other hand, it was not reproducible using this test on 1.1.0 using Volsync 0.5.0. I haven't seen any extra snapshots created (444/3 = 148; fits exactly to the number of iterations): [mperetz@fedora ~]$ oc get volumesnapshotcontents.snapshot.storage.k8s.io -l velero.io/restore-name --no-headers | wc -l # iteration 148 444 [mperetz@fedora ~]$ oc get volumesnapshotcontents.snapshot.storage.k8s.io -l '!velero.io/backup-name' | grep -c snapcontent 444 [mperetz@fedora ~]$ STEP: Validate the application after restore 11/09/22 05:17:30.538 2022/11/09 05:17:30 *************************************************************************************************************************************** 2022/11/09 05:17:30 Number of successful iterations: 147 2022/11/09 05:17:30 VSR regex map count map[ReplicationDestination.volsync.backube.*not.*found:0 replicationdestinations.volsync.backube.*already.*exists:0 secrets.*already.*exists:0] 2022/11/09 05:17:30 RD regex map count map[a.*replication.*method.*must.*be.*specified:0] STEP: Delete the appplication resources test-849 11/09/22 05:17:30.538 test: https://gitlab.cee.redhat.com/app-mig/oadp-e2e-qe/-/merge_requests/215 emcmulla@redhat.com , rhn-engineering-dymurray , wnstb , since we were not able to reproduce it on 1.1.0 using Volsync 0.5.0, would be helpful if you'll provide further details for this scenario, such as: how many PVCs exactly did you use when you hit this? if you could provide the specific app used, that would be great. --------------------------------------------------------------------------------------------------------------------------- Following https://coreos.slack.com/archives/C0144ECKUJ0/p1667980749532509, moving RedHat QE's sub-task to verified (Release Pending).

GitLab CEE Bot added a comment - 2022/11/07 7:23 AM

Maya Peretz mentioned this issue in a merge request of app-mig / oadp-e2e-qe on branch bug_849:

Test to verify ~~OADP-611~~, ~~OADP-1016~~ and ~~OADP-849~~

GitLab CEE Bot added a comment - 2022/11/07 7:23 AM Maya Peretz mentioned this issue in a merge request of app-mig / oadp-e2e-qe on branch bug_849 : Test to verify OADP-611 , OADP-1016 and OADP-849

Maya Peretz added a comment - 2022/10/20 6:37 AM

wnstb that was intentional this time, please check my last comment ^^

Maya Peretz added a comment - 2022/10/20 6:37 AM wnstb that was intentional this time, please check my last comment ^^

Wes Hayutin added a comment - 2022/10/19 10:12 PM

mperetz@redhat.com FYI.. qe automation moved from on_qa to assigned here.

Wes Hayutin added a comment - 2022/10/19 10:12 PM mperetz@redhat.com FYI.. qe automation moved from on_qa to assigned here.

Maya Peretz added a comment - 2022/10/19 12:36 PM - edited

emcmulla@redhat.com/shawnhurley/wnstb it's kinda hard to verify this at the moment, as with the multiple pvc application I hit this bug more often for somewhat: https://issues.redhat.com/browse/OADP-928

Anyway, I have refactored the code related to datamover and added cassandra app to cover a scenario with multiple PVCs: https://gitlab.cee.redhat.com/app-mig/oadp-e2e-qe/-/blob/master/e2e/app_backup/backup_restore_datamover.go#L141

I will move this bug to Assigned. Please move it backup to ON_QA once https://issues.redhat.com/browse/OADP-928 is resolved

tested on build: oadp-operator-bundle-container-1.1.1-26

Maya Peretz added a comment - 2022/10/19 12:36 PM - edited emcmulla@redhat.com / shawnhurley / wnstb it's kinda hard to verify this at the moment, as with the multiple pvc application I hit this bug more often for somewhat: https://issues.redhat.com/browse/OADP-928 Anyway, I have refactored the code related to datamover and added cassandra app to cover a scenario with multiple PVCs: https://gitlab.cee.redhat.com/app-mig/oadp-e2e-qe/-/blob/master/e2e/app_backup/backup_restore_datamover.go#L141 I will move this bug to Assigned. Please move it backup to ON_QA once https://issues.redhat.com/browse/OADP-928 is resolved tested on build: oadp-operator-bundle-container-1.1.1-26

Emily McMullan added a comment - 2022/09/28 4:40 PM

shawnhurley I think if we add an app with multiple PVCs to the current data mover e2e test, that would suffice. Although currently this test is blocked by a Volsync bug, with a fix released near the end of October afaik.

Emily McMullan added a comment - 2022/09/28 4:40 PM shawnhurley I think if we add an app with multiple PVCs to the current data mover e2e test, that would suffice. Although currently this test is blocked by a Volsync bug, with a fix released near the end of October afaik.

Shawn Hurley added a comment - 2022/09/28 4:02 PM

emcmulla@redhat.com spampatt@redhat.com is there an e2e test that we can write to that causes the failure and then validates this? even if the test only hits 1 out of every 2 runs, getting signal overtime that this does fix the problem will be great

Shawn Hurley added a comment - 2022/09/28 4:02 PM emcmulla@redhat.com spampatt@redhat.com is there an e2e test that we can write to that causes the failure and then validates this? even if the test only hits 1 out of every 2 runs, getting signal overtime that this does fix the problem will be great

Emily McMullan added a comment - 2022/09/27 3:05 PM

wnstb akarol@redhat.com checking the VSR resources for multiples may be difficult because cleanup happens right after a VSR completes, which is before the restore completes. If this issue does happen though you will see multiple Volsync volumeSnapshotContents per VSR at the end of restore.

If everything works correctly then there should be 2 volumeSnapshotContents per PVC (volsync and velero) at the end of restore. So if there are more than that, multiple VSR resources were creating during the process.

Emily McMullan added a comment - 2022/09/27 3:05 PM wnstb akarol@redhat.com checking the VSR resources for multiples may be difficult because cleanup happens right after a VSR completes, which is before the restore completes. If this issue does happen though you will see multiple Volsync volumeSnapshotContents per VSR at the end of restore. If everything works correctly then there should be 2 volumeSnapshotContents per PVC (volsync and velero) at the end of restore. So if there are more than that, multiple VSR resources were creating during the process.

Wes Hayutin added a comment - 2022/09/27 1:19 PM

akarol@redhat.com emcmulla@redhat.com so for testing how does the following sound?

Add another row to the table here [1] where the setup for the app has multiple pvc's? Perhaps we need an app that mounts two pv's and has two pvcs and to execute multiple backup and restores to try and recreate.

WDYT?

[1] https://gitlab.cee.redhat.com/app-mig/oadp-e2e-qe/-/blob/master/e2e/app_backup/backup_restore_datamover.go#L90

Wes Hayutin added a comment - 2022/09/27 1:19 PM akarol@redhat.com emcmulla@redhat.com so for testing how does the following sound? Add another row to the table here [1] where the setup for the app has multiple pvc's? Perhaps we need an app that mounts two pv's and has two pvcs and to execute multiple backup and restores to try and recreate. WDYT? [1] https://gitlab.cee.redhat.com/app-mig/oadp-e2e-qe/-/blob/master/e2e/app_backup/backup_restore_datamover.go#L90

Assignee:: Emily McMullan

Reporter:: Dylan Murray

QA Contact:: Maya Peretz

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2022/07/15 5:43 PM

Updated:: 2025/03/30 3:28 PM

Resolved:: 2022/11/28 2:52 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

Collapse comment: Errata Tool added a comment - 2022/11/28 2:52 AM

Expand comment: Errata Tool added a comment - 2022/11/28 2:52 AM

Collapse comment: Maya Peretz added a comment - 2022/11/09 5:03 AM, Edited by Maya Peretz - 2022/11/09 3:57 PM

Expand comment: Maya Peretz added a comment - 2022/11/09 5:03 AM, Edited by Maya Peretz - 2022/11/09 3:57 PM

Collapse comment: GitLab CEE Bot added a comment - 2022/11/07 7:23 AM

Expand comment: GitLab CEE Bot added a comment - 2022/11/07 7:23 AM

Collapse comment: Maya Peretz added a comment - 2022/10/20 6:37 AM

Expand comment: Maya Peretz added a comment - 2022/10/20 6:37 AM

Collapse comment: Wes Hayutin added a comment - 2022/10/19 10:12 PM

Expand comment: Wes Hayutin added a comment - 2022/10/19 10:12 PM

Collapse comment: Maya Peretz added a comment - 2022/10/19 12:36 PM, Edited by Maya Peretz - 2022/10/19 12:41 PM

Expand comment: Maya Peretz added a comment - 2022/10/19 12:36 PM, Edited by Maya Peretz - 2022/10/19 12:41 PM

Collapse comment: Emily McMullan added a comment - 2022/09/28 4:40 PM

Expand comment: Emily McMullan added a comment - 2022/09/28 4:40 PM

Collapse comment: Shawn Hurley added a comment - 2022/09/28 4:02 PM

Expand comment: Shawn Hurley added a comment - 2022/09/28 4:02 PM

Collapse comment: Emily McMullan added a comment - 2022/09/27 3:05 PM

Expand comment: Emily McMullan added a comment - 2022/09/27 3:05 PM

Collapse comment: Wes Hayutin added a comment - 2022/09/27 1:19 PM

Expand comment: Wes Hayutin added a comment - 2022/09/27 1:19 PM

People

Dates