Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-468

[2295782] [RDR][MDR][Tracker ACM-12448] Post hub recovery, subscription app pods are not coming up after Failover from c1 to c2.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.17.2
    • odf-4.16
    • odf-dr/ramen
    • False
    • Hide

      None

      Show
      None
    • False
    • 4.17.1
    • ?
    • 4.17.0-105
    • ?
    • Hide
      .Post hub recovery, subscription app pods now come up after Failover

      Previously, post hub recovery, the subscription application pods did not come up after failover from primary to the secondary managed clusters. This caused RBAC error occurs in AppSub subscription resource on managed cluster due to a timing issue in the backup and restore scenario.

      This issue has been fixed, and subscription app pods now come up after failover from primary to secondary managed clusters.
      Show
      .Post hub recovery, subscription app pods now come up after Failover Previously, post hub recovery, the subscription application pods did not come up after failover from primary to the secondary managed clusters. This caused RBAC error occurs in AppSub subscription resource on managed cluster due to a timing issue in the backup and restore scenario. This issue has been fixed, and subscription app pods now come up after failover from primary to secondary managed clusters.
    • Bug Fix
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):
      Observing an issue related to subscription apps post MDR co-situated hub recovery(c1+activehub+ceph(zone b) was down). Was able to failover appset pull and discovered apps successfully using the new hub.
      But Sub app pods are not showing up after failover from c1 to c2, but PVCs, vrg of are failedover for these apps.

      DRPC of sub apps shows it has failedover successfully, but respective app pods are missing in c2:
      busybox-sub-1 busybox-sub-1-placement-1-drpc 17h pbyregow-cl1 pbyregow-cl2 Failover FailedOver Completed 2024-07-03T16:04:38Z 2h0m45.152881171s True
      vm-pvc-acm-sub1 vm-pvc-acm-sub1-placement-1-drpc 17h pbyregow-cl1 pbyregow-cl2 Failover FailedOver Completed 2024-07-03T16:17:57Z 2h14m58.850396117s True
      vm-pvc-acm-sub2 vm-pvc-acm-sub2-placement-1-drpc 17h pbyregow-cl1 pbyregow-cl2 Failover FailedOver Completed 2024-07-03T16:18:03Z 2h14m52.041023629s True

      for i in

      {busybox-sub-1,vm-pvc-acm-sub1,vm-pvc-acm-sub2}

      ;do oc get pod,pvc,vrg -n $i;done
      NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
      persistentvolumeclaim/busybox-cephfs-pvc-1 Bound pvc-cba9f468-46ee-41de-a6a5-0650e9235b8b 100Gi RWO ocs-external-storagecluster-cephfs <unset> 19h
      persistentvolumeclaim/busybox-rbd-pvc-1 Bound pvc-4be77410-ef6b-454f-9835-2b8c111f88c6 100Gi RWO ocs-external-storagecluster-ceph-rbd <unset> 19h

      NAME DESIREDSTATE CURRENTSTATE
      volumereplicationgroup.ramendr.openshift.io/busybox-sub-1-placement-1-drpc primary Primary
      NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
      persistentvolumeclaim/vm-1-pvc Bound pvc-96184450-4ed0-4879-84a7-76fd3407af7a 512Mi RWX ocs-external-storagecluster-ceph-rbd <unset> 19h

      NAME DESIREDSTATE CURRENTSTATE
      volumereplicationgroup.ramendr.openshift.io/vm-pvc-acm-sub1-placement-1-drpc primary Primary
      NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
      persistentvolumeclaim/vm-1-pvc Bound pvc-584707a8-81af-4994-9f08-90556b4f26a7 512Mi RWX ocs-external-storagecluster-ceph-rbd <unset> 19h

      NAME DESIREDSTATE CURRENTSTATE
      volumereplicationgroup.ramendr.openshift.io/vm-pvc-acm-sub2-placement-1-drpc primary Primary

      Seeing this error in subscription in Acm console for busybox-sub-1 app:

      {ggithubcom-red-hat-storage-ocs-workloads-ns/ggithubcom-red-hat-storage-ocs-workloads <nil> [] 0xc0025bd470 [] <nil> nil [] [] false}

      { 0001-01-01 00:00:00
      +0000 UTC

      { [] []}

      map[]}}: channels.apps.open-cluster-management.io
      "ggithubcom-red-hat-storage-ocs-workloads" is forbidden: User
      "system:open-cluster-management:cluster:pbyregow-cl2:addon:application-manager:agent:application-manager"
      cannot get resource "channels" in API group
      "apps.open-cluster-management.io" in the namespace
      "ggithubcom-red-hat-storage-ocs-workloads-ns"

      Version of all relevant components (if applicable):
      OCP: 4.16.0-0.nightly-2024-06-27-091410
      ODF: 4.16.0-134
      ACM: 2.11.0-137
      OADP: 1.4 (latest) hub/managed cluster

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Configured MDR cluster as per the versions listed.
      2. Deployed sub, appset pull and discovered apps, applied policies and had them in different states(Deployed/FailedOver/Relocate) on both clusters.
      3. Configured backup, waited ~2 hrs to take latest backup. Had the latest backup without any changes in between for any apps.
      4. Brought down c1+activehub+3cephnodes
      5. Restored on newhub, Restore completed successfully, followed the hub recovery doc to apply appliedManifestWorkEvictionGracePeriod: "24h"
      6. DRpolicy reached validated state.
      7. Removed appliedManifestWorkEvictionGracePeriod after DRpolicy and drpc recovered.
      7. Failedover apps from c1 to c2.

      Actual results:
      Subscription app pods did not come up after failover post hubrecovery.

      Expected results:
      Sub apps pods should up along with rest of the resources.

      Additional info:
      Rest of the apps(appset-pull & disc) got failedover to c2 successfully.

              egershko Elena Gershkovich
              rhn-support-pbyregow Parikshith Byregowda
              Aman Agrawal
              Aman Agrawal Aman Agrawal
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: