Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-9562

Post hub recovery ACM pods are in crashloopBackoff and requests from the hub are not being serviced after C1 was brought back online

XMLWordPrintable

    • False
    • None
    • False
    • No

      Description of problem:
      ------------------------
      Post hub recovery failover succeeded after the fix applied by benamar on the cluster for bz
      https://bugzilla.redhat.com/show_bug.cgi?id=2258351#c4
      But cleanup is still stuck after bringing C1 cluster and ceph nodes online

      $ oc get drpc --all-namespaces -o wide
      NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
      cephfs-sub1 cephfs-sub1-placement-1-drpc 38h sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
      cephfs-sub2 cephfs-sub2-placement-1-drpc 37h sraghave-c2-jan Deployed Completed 2024-01-16T16:35:06Z 22.043807167s True
      openshift-gitops cephfs-appset1-placement-drpc 38h sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
      openshift-gitops cephfs-appset2-placement-drpc 37h sraghave-c2-jan Deployed Completed 2024-01-16T16:34:08Z 1.043287787s True
      openshift-gitops helloworld-appset1-placement-drpc 38h sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
      openshift-gitops helloworld-appset2-placement-drpc 37h sraghave-c2-jan Deployed Completed 2024-01-16T16:34:33Z 1.048046697s True
      openshift-gitops rbd-appset1-placement-drpc 38h sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
      openshift-gitops rbd-appset2-placement-drpc 37h sraghave-c2-jan Deployed Completed 2024-01-16T16:34:41Z 15.040241914s True
      rbd-sub1 rbd-sub1-placement-1-drpc 38h sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
      rbd-sub2 rbd-sub2-placement-1-drpc 37h sraghave-c2-jan Deployed Completed 2024-01-16T16:36:06Z 1.04207343s True

      After some debugging with benamar

      oc get pods -A | grep open-cluster
      open-cluster-management-agent-addon application-manager-7f46b9d565-2hl9m 1/1 Running 0 26m
      open-cluster-management-agent-addon cert-policy-controller-84647cb784-fmrg4 1/1 Running 0 28m
      open-cluster-management-agent-addon cluster-proxy-proxy-agent-9cd9979df-q27tg 2/2 Running 0 26m
      open-cluster-management-agent-addon cluster-proxy-service-proxy-86dd8d4645-pjqkq 1/1 Running 0 28m
      open-cluster-management-agent-addon config-policy-controller-5d48696c85-5n6xs 2/2 Running 0 26m
      open-cluster-management-agent-addon governance-policy-framework-69f66dcd85-8nvfz 1/2 CrashLoopBackOff 7 (3m12s ago) 28m
      open-cluster-management-agent-addon iam-policy-controller-69b7788bc8-lbk8x 1/1 Running 0 28m
      open-cluster-management-agent-addon klusterlet-addon-search-cbff47756-zq64v 1/1 Running 0 28m
      open-cluster-management-agent-addon klusterlet-addon-workmgr-785c8995-z4qx4 0/1 CrashLoopBackOff 6 (2m7s ago) 26m
      open-cluster-management-agent klusterlet-5559fbb46b-vkdk4 1/1 Running 0 26m
      open-cluster-management-agent klusterlet-agent-6485859859-4t2jm 1/1 Running 2 (3m32s ago) 28m
      open-cluster-management-agent klusterlet-agent-6485859859-x2mfs 1/1 Running 0 26m
      open-cluster-management-agent klusterlet-agent-6485859859-xdnt4 1/1 Running 0 26m

      Version of all relevant components (if applicable):
      ---------------------------------------------------
      OCP - 4.15.0-0.nightly-2024-01-10-101042
      ODF - 4.15.0-113.stable
      RHCS - 7.0
      ACM - 2.9.1

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?
      yes, If apps stuck in cleaning state i am afraid that relocate might be unsuccessful

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?
      4

      Can this issue reproducible?
      1/1

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      --------------------
      1. On MDR hub recovery setup, Deploy subscription apps and appset apps
      ensure that few apps are moved to failedover and relocated states
      2. Ensure that few apps are just installed without assigning any DRPolicy to them
      3. Ensure that backup is taken on both active and passive hub
      4. Bring zone b down ( ceph 0, 1, 2 nodes, C1 cluster and Active hub cluster)
      5. Restore passive hub into active hub
      6. After importing secrets of C2 cluster check DRPolicy to be in validated state
      7. Now assign DRPOlicy to apps that are already installed on clusters and check DRPC statuses
      DRPCs will look like this due to BZ-2258351
      $ oc get drpc --all-namespaces -o wide
      NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
      cephfs-sub1 cephfs-sub1-placement-1-drpc 70m sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
      cephfs-sub2 cephfs-sub2-placement-1-drpc 101s sraghave-c2-jan Deployed Completed 2024-01-16T16:35:06Z 22.043807167s True
      openshift-gitops cephfs-appset1-placement-drpc 70m sraghave-c1-jan sraghave-c2-jan Relocate Paused True
      openshift-gitops cephfs-appset2-placement-drpc 3m sraghave-c2-jan Deployed Completed 2024-01-16T16:34:08Z 1.043287787s True
      openshift-gitops helloworld-appset1-placement-drpc 70m sraghave-c1-jan sraghave-c2-jan Failover FailedOver Cleaning Up False
      openshift-gitops helloworld-appset2-placement-drpc 2m35s sraghave-c2-jan Deployed Completed 2024-01-16T16:34:33Z 1.048046697s True
      openshift-gitops rbd-appset1-placement-drpc 70m sraghave-c1-jan sraghave-c2-jan Relocate Paused True
      openshift-gitops rbd-appset2-placement-drpc 2m14s sraghave-c2-jan Deployed Completed 2024-01-16T16:34:41Z 15.040241914s True
      rbd-sub1 rbd-sub1-placement-1-drpc 70m sraghave-c1-jan sraghave-c2-jan Relocate Paused True
      rbd-sub2 rbd-sub2-placement-1-drpc 62s sraghave-c2-jan Deployed Completed 2024-01-16T16:36:06Z 1.04207343s True
      9. Fix applied for BZ
      10. Now fence C1 and failover the apps from C1 to C2
      11. The failover should succeed and apps in cleaning up phase as expected as C1 and ceph nodes still down
      12. Recover C1 and ceph nodes, unfence and reboot
      13. NOTE DRPCs still in cleaning up phase for more than 12 hrs

            zxue@redhat.com ZHAO XUE
            rhn-support-rgowdege Rakesh GM
            Hui Chen Hui Chen
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: