-
Bug
-
Resolution: Not a Bug
-
Undefined
-
None
-
ACM 2.9.1
-
False
-
None
-
False
-
No
-
-
Description of problem:
------------------------
Post hub recovery failover succeeded after the fix applied by benamar on the cluster for bz
https://bugzilla.redhat.com/show_bug.cgi?id=2258351#c4
But cleanup is still stuck after bringing C1 cluster and ceph nodes online
$ oc get drpc --all-namespaces -o wide
NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
cephfs-sub1 cephfs-sub1-placement-1-drpc 38h sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
cephfs-sub2 cephfs-sub2-placement-1-drpc 37h sraghave-c2-jan Deployed Completed 2024-01-16T16:35:06Z 22.043807167s True
openshift-gitops cephfs-appset1-placement-drpc 38h sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
openshift-gitops cephfs-appset2-placement-drpc 37h sraghave-c2-jan Deployed Completed 2024-01-16T16:34:08Z 1.043287787s True
openshift-gitops helloworld-appset1-placement-drpc 38h sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
openshift-gitops helloworld-appset2-placement-drpc 37h sraghave-c2-jan Deployed Completed 2024-01-16T16:34:33Z 1.048046697s True
openshift-gitops rbd-appset1-placement-drpc 38h sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
openshift-gitops rbd-appset2-placement-drpc 37h sraghave-c2-jan Deployed Completed 2024-01-16T16:34:41Z 15.040241914s True
rbd-sub1 rbd-sub1-placement-1-drpc 38h sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
rbd-sub2 rbd-sub2-placement-1-drpc 37h sraghave-c2-jan Deployed Completed 2024-01-16T16:36:06Z 1.04207343s True
After some debugging with benamar
oc get pods -A | grep open-cluster
open-cluster-management-agent-addon application-manager-7f46b9d565-2hl9m 1/1 Running 0 26m
open-cluster-management-agent-addon cert-policy-controller-84647cb784-fmrg4 1/1 Running 0 28m
open-cluster-management-agent-addon cluster-proxy-proxy-agent-9cd9979df-q27tg 2/2 Running 0 26m
open-cluster-management-agent-addon cluster-proxy-service-proxy-86dd8d4645-pjqkq 1/1 Running 0 28m
open-cluster-management-agent-addon config-policy-controller-5d48696c85-5n6xs 2/2 Running 0 26m
open-cluster-management-agent-addon governance-policy-framework-69f66dcd85-8nvfz 1/2 CrashLoopBackOff 7 (3m12s ago) 28m
open-cluster-management-agent-addon iam-policy-controller-69b7788bc8-lbk8x 1/1 Running 0 28m
open-cluster-management-agent-addon klusterlet-addon-search-cbff47756-zq64v 1/1 Running 0 28m
open-cluster-management-agent-addon klusterlet-addon-workmgr-785c8995-z4qx4 0/1 CrashLoopBackOff 6 (2m7s ago) 26m
open-cluster-management-agent klusterlet-5559fbb46b-vkdk4 1/1 Running 0 26m
open-cluster-management-agent klusterlet-agent-6485859859-4t2jm 1/1 Running 2 (3m32s ago) 28m
open-cluster-management-agent klusterlet-agent-6485859859-x2mfs 1/1 Running 0 26m
open-cluster-management-agent klusterlet-agent-6485859859-xdnt4 1/1 Running 0 26m
Version of all relevant components (if applicable):
---------------------------------------------------
OCP - 4.15.0-0.nightly-2024-01-10-101042
ODF - 4.15.0-113.stable
RHCS - 7.0
ACM - 2.9.1
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes, If apps stuck in cleaning state i am afraid that relocate might be unsuccessful
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4
Can this issue reproducible?
1/1
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
--------------------
1. On MDR hub recovery setup, Deploy subscription apps and appset apps
ensure that few apps are moved to failedover and relocated states
2. Ensure that few apps are just installed without assigning any DRPolicy to them
3. Ensure that backup is taken on both active and passive hub
4. Bring zone b down ( ceph 0, 1, 2 nodes, C1 cluster and Active hub cluster)
5. Restore passive hub into active hub
6. After importing secrets of C2 cluster check DRPolicy to be in validated state
7. Now assign DRPOlicy to apps that are already installed on clusters and check DRPC statuses
DRPCs will look like this due to BZ-2258351
$ oc get drpc --all-namespaces -o wide
NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
cephfs-sub1 cephfs-sub1-placement-1-drpc 70m sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
cephfs-sub2 cephfs-sub2-placement-1-drpc 101s sraghave-c2-jan Deployed Completed 2024-01-16T16:35:06Z 22.043807167s True
openshift-gitops cephfs-appset1-placement-drpc 70m sraghave-c1-jan sraghave-c2-jan Relocate Paused True
openshift-gitops cephfs-appset2-placement-drpc 3m sraghave-c2-jan Deployed Completed 2024-01-16T16:34:08Z 1.043287787s True
openshift-gitops helloworld-appset1-placement-drpc 70m sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
openshift-gitops helloworld-appset2-placement-drpc 2m35s sraghave-c2-jan Deployed Completed 2024-01-16T16:34:33Z 1.048046697s True
openshift-gitops rbd-appset1-placement-drpc 70m sraghave-c1-jan sraghave-c2-jan Relocate Paused True
openshift-gitops rbd-appset2-placement-drpc 2m14s sraghave-c2-jan Deployed Completed 2024-01-16T16:34:41Z 15.040241914s True
rbd-sub1 rbd-sub1-placement-1-drpc 70m sraghave-c1-jan sraghave-c2-jan Relocate Paused True
rbd-sub2 rbd-sub2-placement-1-drpc 62s sraghave-c2-jan Deployed Completed 2024-01-16T16:36:06Z 1.04207343s True
9. Fix applied for BZ
10. Now fence C1 and failover the apps from C1 to C2
11. The failover should succeed and apps in cleaning up phase as expected as C1 and ceph nodes still down
12. Recover C1 and ceph nodes, unfence and reboot
13. NOTE DRPCs still in cleaning up phase for more than 12 hrs