Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.15
Component/s: odf-dr/ramen
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2264765
Dev Approval:
?
QE Approval:
?
Release Note Text:

Hide
Cause:
The DR controller executes full reconciliation as needed. When a cluster becomes inaccessible, the DR controller performs a sanity check. If the workload has already been relocated, this sanity check causes the PeerReady flag associated with the workload to be disabled, and the sanity check does not complete due to the cluster being offline.

Consequence:
Disabling the PeerReady flag prevents users from changing the action to Failover.

Workaround:
In this scenario, users must use the CLI to address this issue.

Result:
Using the CLI enables users to change the DR action to Failover despite the disabled PeerReady flag.

Show
Cause: The DR controller executes full reconciliation as needed. When a cluster becomes inaccessible, the DR controller performs a sanity check. If the workload has already been relocated, this sanity check causes the PeerReady flag associated with the workload to be disabled, and the sanity check does not complete due to the cluster being offline. Consequence: Disabling the PeerReady flag prevents users from changing the action to Failover. Workaround: In this scenario, users must use the CLI to address this issue. Result: Using the CLI enables users to change the DR action to Failover despite the disabled PeerReady flag.
Release Note Type:
Known Issue
Target Release:

odf-4.21
Git Pull Request:
https://github.com/RamenDR/ramen/pull/1479
Intelligence Requested:
Market:

Sprint:
RamenDR sprint 2024 #11, RamenDR sprint 2024 #12, RamenDR sprint 2024 #15, RamenDR sprint 2024 #15, RamenDR sprint 2024 #16, RamenDR sprint 2024 #18, RamenDR sprint 2024 #19, RamenDR sprint 2024 #21

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):

Version of all relevant components (if applicable):
ODF 4.15.0-132.stable
OCP 4.15.0-0.nightly-2024-02-13-231030
ACM 2.9.2 GA'ed
Submariner 0.16.3
ceph version 17.2.6-194.el9cp (d9f4aedda0fc0d99e7e0e06892a69523d2eb06dc) quincy (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
*Active hub at neutral site*

1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types.
2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster) but the apps which were failedover from C1 to C2 were relocated back to C1 and the apps which were relocated to C2 were failedover to C1 (with all nodes up and running).
Ensure that we have all workloads combinations in distinct states like deployed, failedover, relocated on C1, and a few workloads in deployed state on C2 as well.
4. Let the latest backups be taken at least 1 for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc.
5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully reported, drpolicy gets validated.
6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not.
They seem to have retained their last state which was backedup. So everything is fine so far.
7. Let IOs continue for a few hours (20-30hrs). Failover the cephfs workloads running on C2 to C1 with all nodes of C2 up and running.
8. After successful failover and cleanup, wait for sync to resume and after some time bring primary cluster down (all nodes). Bring it up after a few hours.
9. Check if drpc state is still the same and data sync for all workload is resuming as expected.
10. After a few hours, bring master nodes of primary cluster down and check drpc again.

(Older hub remains down forever and is completely unreachable).

The below CephFS app changes it's state from Relocated to Relocating without any action on it.

Before==>

busybox-workloads-5 sub-cephfs-busybox5-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Relocate Relocated Completed 2024-02-17T14:56:29Z 18h39m39.024430842s True

After==>

busybox-workloads-5 sub-cephfs-busybox5-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Relocate Relocating 2024-02-18T19:16:07Z False

Since it assumes that relocate operation for this workload is in progress, it couldn't be failedover as peer ready becomes false, while all other workloads running on primary were successfully failedover after the master nodes of the primary cluster went down.

amagrawa:hub$ drpc
NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
busybox-workloads-13 sub-rbd-busybox13-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver WaitForReadiness 2024-02-18T19:20:17Z False
busybox-workloads-14 sub-rbd-busybox14-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver WaitForReadiness 2024-02-18T19:20:25Z False
busybox-workloads-15 sub-rbd-busybox15-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver WaitForReadiness 2024-02-18T19:20:33Z False
busybox-workloads-16 sub-rbd-busybox16-placement-1-drpc 2d9h amagrawa-odf2 Deployed Completed 2024-02-16T10:12:51Z 660.371688ms True
busybox-workloads-5 sub-cephfs-busybox5-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Relocate Relocating 2024-02-18T19:16:07Z False
busybox-workloads-6 sub-cephfs-busybox6-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:19:49Z False
busybox-workloads-7 sub-cephfs-busybox7-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:19:59Z False
busybox-workloads-8 sub-cephfs-busybox8-placement-1-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:20:06Z False
openshift-gitops appset-cephfs-busybox1-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:18:33Z False
openshift-gitops appset-cephfs-busybox2-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:18:38Z False
openshift-gitops appset-cephfs-busybox3-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:18:43Z False
openshift-gitops appset-cephfs-busybox4-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver Cleaning Up 2024-02-18T19:18:48Z False
openshift-gitops appset-rbd-busybox10-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver WaitForReadiness 2024-02-18T19:18:52Z False
openshift-gitops appset-rbd-busybox11-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver WaitForReadiness 2024-02-18T19:18:58Z False
openshift-gitops appset-rbd-busybox12-placement-drpc 2d9h amagrawa-odf2 Deployed Completed 2024-02-16T10:13:47Z 571.259493ms True
openshift-gitops appset-rbd-busybox9-placement-drpc 2d9h amagrawa-prim amagrawa-odf2 Failover FailedOver WaitForReadiness 2024-02-18T19:19:17Z False

This leads to inaccessibility of this application.

Actual results: [RDR] [Hub recovery] CephFS workload changes it's state from Relocated to Relocating on node failure

Expected results: Applications should retain their original state after node failure so that they could be failedover.

Additional info:

Assignee:: Benamar Mekhissi

Reporter:: Aman Agrawal

Need Info From:: Benamar Mekhissi

QA Contact:: Krishnaram Karthick Ramdoss

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Created:: 2024/02/18 7:54 PM

Updated:: 2025/10/02 3:20 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty