Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-774

[2189547] [IBM Z] [MDR]: Failover of application stuck in "Failing over" state

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.12.z
    • odf-4.12
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • 4.12.5
    • ?
    • ?
    • If docs needed, set a value
    • Proposed
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      managed cluster1 - mc1
      managed cluster2- mc2

      Application failover from mc1 to mc2 stuck in "Failing Over" state as restoring the pvs to mc2 failed due to noobaa S3 communication failure.
      Only namespace of the application got created on mc2 during the failover operation.

      Before initiating the failover operation the noobaa status is Ready on both MC1 and MC2, uploading the BZ with the must-gather logs of mc1 and mc2 before failover operation.

      Hub:

      [root@a3e25001 ~]# oc get drpc -n busybox-sample
      NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE
      busybox-placement-1-drpc 20h ocsm4205001 ocpm4202001 Failover FailingOver
      [root@a3e25001 ~]#

      1. oc get drpc busybox-placement-1-drpc -n busybox-sample -oyaml
        ...
        status:
        actionStartTime: "2023-04-24T18:08:50Z"
        conditions:
      • lastTransitionTime: "2023-04-24T18:08:50Z"
        message: Started failover to cluster "ocpm4202001"
        observedGeneration: 3
        reason: NotStarted
        status: "False"
        type: PeerReady
      • lastTransitionTime: "2023-04-24T18:08:50Z"
        message: Waiting for PV restore to complete...)
        observedGeneration: 3
        reason: FailingOver
        status: "False"
        type: Available
        lastUpdateTime: "2023-04-25T14:34:01Z"
        phase: FailingOver
        preferredDecision:
        clusterName: ocsm4205001
        clusterNamespace: ocsm4205001
        progression: WaitingForPVRestore
        resourceConditions:
        conditions:
      • lastTransitionTime: "2023-04-24T17:58:02Z"
        message: PVCs in the VolumeReplicationGroup are ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      • lastTransitionTime: "2023-04-24T17:58:02Z"
        message: VolumeReplicationGroup is replicating
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      • lastTransitionTime: "2023-04-24T17:58:01Z"
        message: Restored PV cluster data
        observedGeneration: 1
        reason: Restored
        status: "True"
        type: ClusterDataReady
      • lastTransitionTime: "2023-04-25T14:02:42Z"
        message: VRG Kube object protect error
        observedGeneration: 1
        reason: UploadError
        status: "False"
        type: ClusterDataProtected
        resourceMeta:
        generation: 1
        kind: VolumeReplicationGroup
        name: busybox-placement-1-drpc
        namespace: busybox-sample
        protectedpvcs:
      • busybox-pvc

      MC2:

      [root@m4202001 ~]# oc get ns busybox-sample
      NAME STATUS AGE
      busybox-sample Active 20h

      [root@m4202001 ~]# oc get all,pvc -n busybox-sample
      No resources found in busybox-sample namespace.
      [root@m4202001 ~]#

      [root@m4202001 ~]# oc get po -n openshift-storage
      NAME READY STATUS RESTARTS AGE
      csi-addons-controller-manager-6bb96f77b6-fcb22 2/2 Running 0 22h
      csi-cephfsplugin-8h6td 2/2 Running 2 25h
      csi-cephfsplugin-9nwpf 2/2 Running 2 25h
      csi-cephfsplugin-provisioner-6c7d889599-25knr 5/5 Running 0 22h
      csi-cephfsplugin-provisioner-6c7d889599-cn6kg 5/5 Running 0 22h
      csi-cephfsplugin-sbx2r 2/2 Running 2 25h
      csi-rbdplugin-484rx 3/3 Running 3 25h
      csi-rbdplugin-5qpsx 3/3 Running 3 25h
      csi-rbdplugin-k7qkv 3/3 Running 3 25h
      csi-rbdplugin-provisioner-d46b79bbb-868p8 6/6 Running 0 22h
      csi-rbdplugin-provisioner-d46b79bbb-frgq8 6/6 Running 0 22h
      noobaa-core-0 1/1 Running 0 22h
      noobaa-db-pg-0 1/1 Running 0 22h
      noobaa-endpoint-5bdc586b7d-v97bf 1/1 Running 0 22h
      noobaa-operator-66fb78dd94-m7lbh 1/1 Running 0 22h
      ocs-metrics-exporter-6b96597864-sbrtd 1/1 Running 0 22h
      ocs-operator-5598965945-pkmgw 1/1 Running 0 22h
      odf-console-55f8c5f6dd-7fhxc 1/1 Running 0 22h
      odf-operator-controller-manager-5cbb545ddc-h72wf 2/2 Running 0 22h
      rook-ceph-operator-64bb84d64f-z5fs9 1/1 Running 0 22h
      token-exchange-agent-7fd47f9bd8-m6465 1/1 Running 0 21h

      [root@m4202001 ~]#oc get noobaa -n openshift-storage noobaa -o yaml

      ....
      phase: Configuring
      status: "False" [34/1412]
      type: Available

      • lastHeartbeatTime: "2023-04-24T12:57:13Z"
        lastTransitionTime: "2023-04-24T18:05:12Z"
        message: 'could not open file "base/16385/2601": Read-only file system'
        reason: TemporaryError
        status: "True"
        type: Progressing
      • lastHeartbeatTime: "2023-04-24T12:57:13Z"
        lastTransitionTime: "2023-04-24T12:57:13Z"
        message: 'could not open file "base/16385/2601": Read-only file system'
        reason: TemporaryError
        status: "False"
        type: Degraded
      • lastHeartbeatTime: "2023-04-24T12:57:13Z"
        lastTransitionTime: "2023-04-24T18:05:12Z"
        message: 'could not open file "base/16385/2601": Read-only file system'
        reason: TemporaryError
        status: "False"
        type: Upgradeable
      • lastHeartbeatTime: "2023-04-24T12:57:13Z"
        lastTransitionTime: "2023-04-24T12:57:13Z"
        status: k8s
        type: KMS-Type
      • lastHeartbeatTime: "2023-04-24T12:57:13Z"
        lastTransitionTime: "2023-04-24T12:58:15Z"
        status: Sync
        type: KMS-Status
        endpoints:
        readyCount: 1
        virtualHosts:
      • s3.openshift-storage.svc
        observedGeneration: 2
        phase: Configuring
        readme: "\n\n\tNooBaa operator is still working to reconcile this system.\n\tCheck
        out the system status.phase, status.conditions, and events with:\n\n\t\tkubectl
        -n openshift-storage describe noobaa\n\t\tkubectl -n openshift-storage get noobaa
        -o yaml\n\t\tkubectl -n openshift-storage get events --sort-by=metadata.creationTimestamp\n\n\tYou
        can wait for a specific condition with:\n\n\t\tkubectl -n openshift-storage wait
        noobaa/noobaa --for condition=available --timeout -1s\n\n\tNooBaa Core Version:
        \ master-20220913\n\tNooBaa Operator Version: 5.12.0\n"

      RHCS:

      [root@rhcs01 ~]# ceph -s
      cluster:
      id: 778d5284-ddf7-11ed-a790-525400c41d12
      health: HEALTH_OK

      services:
      mon: 5 daemons, quorum rhcs01,rhcs02,rhcs04,rhcs05,rhcs07 (age 5d)
      mgr: rhcs01.ipckaw(active, since 6d), standbys: rhcs04.kfpmco
      mds: 1/1 daemons up, 1 standby
      osd: 6 osds: 6 up (since 6d), 6 in (since 6d)
      rgw: 2 daemons active (2 hosts, 1 zones)

      data:
      volumes: 1/1 healthy
      pools: 10 pools, 289 pgs
      objects: 1.18k objects, 1.8 GiB
      usage: 9.5 GiB used, 2.9 TiB / 2.9 TiB avail
      pgs: 289 active+clean

      Version of all relevant components (if applicable):
      OCP: 4.12.11
      odf-operator.v4.12.2-rhodf
      RHCS: 5.3.z2

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Yes

      Is there any workaround available to the best of your knowledge?
      No

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Configure Metro DR environment with MC1, MC2 and Hub cluster
      2. Deploy sample application busybox
      3. Apply fencing to mc1 on the hub cluster and verify that the fencing is successful
      4. Initiate failover of the application from mc1 to mc2

      Actual results:
      Failover stuck in "Failing Over" state

      Expected results:
      Application failover should be successful

      Additional info:

      Must gather logs of mc1 and mc2 before failover operation:

      https://drive.google.com/file/d/1JjZ3e2xSCw33eszmwpr3iGLYYw9ac7NW/view?usp=share_link

      https://drive.google.com/file/d/1XZVOxplFCsF4PNLBhZePl9LWsCSni7se/view?usp=share_link

      Must gather logs of mc1, mc2 and hub after failover operation initiated:

      https://drive.google.com/file/d/1WdKv-rTOO0cAtdotdz4G_yPEXC1RBmS7/view?usp=share_link

      https://drive.google.com/file/d/1tFZ2pvuJ9D_2yYC5EstpNYqQpuP0tvys/view?usp=share_link

      https://drive.google.com/file/d/1e4J2J_UzgcBvpWguEIZgMzlZDR9jsKKE/view?usp=share_link

              edonnell@redhat.com Erin Donnelly
              sravikab2 Sravika Balusu (Inactive)
              Harish Nallur Vittal Rao Harish Nallur Vittal Rao
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: