Uploaded image for project: 'OpenShift Data Foundation Request For Enhancement'
  1. OpenShift Data Foundation Request For Enhancement
  2. ODFRFE-59

[2216440] [IBM Z /MDR]: Failover of application fails when OpenShift is reinstalled on one of the managed clusters after a disaster

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • odf-4.20.z
    • False
    • Hide

      None

      Show
      None
    • False
    • If docs needed, set a value

      Description of problem (please be detailed as possible and provide log
      snippests):

      One of the managed clusters(mc2) went down (due to disaster/network issue) and OCP was reinstalled on mc2 to have the cluster up and running. Configured SSL access across clusters with the new ingress cert. In the Hub cluster, detached old mc2 and imported the reinstalled mc2 again. OpenShift DR Cluster operator was installed on mc2 automatically once the cluster is imported in Hub cluster. Failover of the application from mc1 to mc2 stuck in "Failing over" state with the following error message:

      1. oc describe drpc busybox-placement-1-drpc -nbusybox-sample
        .....
        ......
        Events:
          Type     Reason                   Age                  From                           Message
          ----     ------                   ----                 ----                           -------
          Warning  unknown state            16m                  controller_DRPlacementControl  next state not known
          Warning  DRPCFailingOver          16m                  controller_DRPlacementControl  Failing over the application and VRG
          Warning  DRPCClusterSwitchFailed  16m                  controller_DRPlacementControl  failed to get VRG busybox-placement-1-drpc from cluster ocpm4202001 (err: getManagedClusterResource results:  "requested resource not found in ManagedCluster" not found)
          Warning  DRPCClusterSwitchFailed  6m48s (x5 over 16m)  controller_DRPlacementControl  Waiting for App resources to be restored...)

      vrg log on mc1 reports the following error

      2023-06-15T16:53:37.670Z ERROR controllers.VolumeReplicationGroup.vrginstance controllers/vrg_vrgobject.go:50 VRG Kube object protect error

      {"VolumeReplicationGroup": "busybox-appset-sample/appset1-busybox-placement-drpc", "rid": "69471d6d-6a0e-450b-b5ef-887595f196b1", "State": "primary", "profile": "s3profile-ocpm4202001-ocs-external-storagecluster", "error": "failed to upload data of odrbucket-373521917843:busybox-appset-sample/appset1-busybox-placement-drpc/v1alpha1.VolumeReplicationGroup/a, InvalidAccessKeyId: The AWS access key Id you provided does not exist in our records.\n\tstatus code: 403, request id: lixdr8jk-dg4r8q-1ddq, host id: lixdr8jk-dg4r8q-1ddq"}

      github.com/ramendr/ramen/controllers.(*VRGInstance).vrgObjectProtect
      /remote-source/app/controllers/vrg_vrgobject.go:50
      github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileAsPrimary
      /remote-source/app/controllers/volumereplicationgroup_controller.go:918
      github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary
      /remote-source/app/controllers/volumereplicationgroup_controller.go:889
      github.com/ramendr/ramen/controllers.(*VRGInstance).processVRGActions
      /remote-source/app/controllers/volumereplicationgroup_controller.go:551
      github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG
      /remote-source/app/controllers/volumereplicationgroup_controller.go:524
      github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile
      /remote-source/app/controllers/volumereplicationgroup_controller.go:413
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
      /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
      /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
      /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
      /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235

      Version of all relevant components (if applicable):
      OCP: 4.13.0
      ODF on hub, mc1, mc2: 4.13.0-218

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?
      No

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Configure MDR environment with 1 hub cluster and 2 managed cluster (Hub, mc1 and mc2)
      2. Deploy application and perform failover and relocate operations
      3. Bring down one of the managed clusters (mc2) and reinstall mc2 with OCP
      4. Configure SSL access across clusters with the new ingress cert from mc2
      5. In the Hub cluster, detach old mc2 and import the reinstalled mc2.
      6. OpenShift DR Cluster operator would be installed on mc2 automatically once the cluster is imported in Hub cluster
      7. Perform failover of application from mc1 to mc2

      Actual results:
      Failover of application stuck in failing over state

      Expected results:
      Failover of application should be successful

      Additional info:

      Attaching the vrg logs of the managed clusters to bugzilla and uploading the must-gather logs of all the cluster in the google drive

      https://drive.google.com/file/d/1NmvKrORqcX-17Bd8YfOoLTHYwGRqbLX8/view?usp=sharing

              nsoffer@redhat.com Nir Soffer
              sravikab2 Sravika Balusu
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated: