Loading...

Type: Feature Request
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.20.z
Labels:
- MDR
- dr

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2216440
Release Note Type:
If docs needed, set a value

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Description of problem (please be detailed as possible and provide log
snippests):

One of the managed clusters(mc2) went down (due to disaster/network issue) and OCP was reinstalled on mc2 to have the cluster up and running. Configured SSL access across clusters with the new ingress cert. In the Hub cluster, detached old mc2 and imported the reinstalled mc2 again. OpenShift DR Cluster operator was installed on mc2 automatically once the cluster is imported in Hub cluster. Failover of the application from mc1 to mc2 stuck in "Failing over" state with the following error message:

oc describe drpc busybox-placement-1-drpc -nbusybox-sample
.....
......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning unknown state 16m controller_DRPlacementControl next state not known
Warning DRPCFailingOver 16m controller_DRPlacementControl Failing over the application and VRG
Warning DRPCClusterSwitchFailed 16m controller_DRPlacementControl failed to get VRG busybox-placement-1-drpc from cluster ocpm4202001 (err: getManagedClusterResource results: "requested resource not found in ManagedCluster" not found)
Warning DRPCClusterSwitchFailed 6m48s (x5 over 16m) controller_DRPlacementControl Waiting for App resources to be restored...)

vrg log on mc1 reports the following error

2023-06-15T16:53:37.670Z ERROR controllers.VolumeReplicationGroup.vrginstance controllers/vrg_vrgobject.go:50 VRG Kube object protect error

{"VolumeReplicationGroup": "busybox-appset-sample/appset1-busybox-placement-drpc", "rid": "69471d6d-6a0e-450b-b5ef-887595f196b1", "State": "primary", "profile": "s3profile-ocpm4202001-ocs-external-storagecluster", "error": "failed to upload data of odrbucket-373521917843:busybox-appset-sample/appset1-busybox-placement-drpc/v1alpha1.VolumeReplicationGroup/a, InvalidAccessKeyId: The AWS access key Id you provided does not exist in our records.\n\tstatus code: 403, request id: lixdr8jk-dg4r8q-1ddq, host id: lixdr8jk-dg4r8q-1ddq"}

github.com/ramendr/ramen/controllers.(*VRGInstance).vrgObjectProtect
/remote-source/app/controllers/vrg_vrgobject.go:50
github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileAsPrimary
/remote-source/app/controllers/volumereplicationgroup_controller.go:918
github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary
/remote-source/app/controllers/volumereplicationgroup_controller.go:889
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRGActions
/remote-source/app/controllers/volumereplicationgroup_controller.go:551
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG
/remote-source/app/controllers/volumereplicationgroup_controller.go:524
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile
/remote-source/app/controllers/volumereplicationgroup_controller.go:413
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235

Version of all relevant components (if applicable):
OCP: 4.13.0
ODF on hub, mc1, mc2: 4.13.0-218

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Configure MDR environment with 1 hub cluster and 2 managed cluster (Hub, mc1 and mc2)
2. Deploy application and perform failover and relocate operations
3. Bring down one of the managed clusters (mc2) and reinstall mc2 with OCP
4. Configure SSL access across clusters with the new ingress cert from mc2
5. In the Hub cluster, detach old mc2 and import the reinstalled mc2.
6. OpenShift DR Cluster operator would be installed on mc2 automatically once the cluster is imported in Hub cluster
7. Perform failover of application from mc1 to mc2

Actual results:
Failover of application stuck in failing over state

Expected results:
Failover of application should be successful

Additional info:

Attaching the vrg logs of the managed clusters to bugzilla and uploading the must-gather logs of all the cluster in the google drive

https://drive.google.com/file/d/1NmvKrORqcX-17Bd8YfOoLTHYwGRqbLX8/view?usp=sharing

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty