Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-527

[2267731] [RDR] RBD apps fail to Relocate when using stale Ceph pool IDs from replacement cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • odf-4.15
    • rook
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • Hide
      .RBD applications fail to Relocate when using stale Ceph pool IDs from replacement cluster
      For the applications created before the new peer cluster is created, it is not possible to mount the RBD PVC because when a peer cluster is replaced, it is not possible to update the CephBlockPoolID’s mapping in the CSI configmap.

      Workaround: Update the `rook-ceph-csi-mapping-config` configmap with cephBlockPoolID's mapping on the peer cluster that is not replaced. This enables mounting the RBD PVC for the application.
      Show
      .RBD applications fail to Relocate when using stale Ceph pool IDs from replacement cluster For the applications created before the new peer cluster is created, it is not possible to mount the RBD PVC because when a peer cluster is replaced, it is not possible to update the CephBlockPoolID’s mapping in the CSI configmap. Workaround: Update the `rook-ceph-csi-mapping-config` configmap with cephBlockPoolID's mapping on the peer cluster that is not replaced. This enables mounting the RBD PVC for the application.
    • Known Issue
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      Created RDR environment with hub cluster (perf1) and 2 managed clusters perf2 and perf3. Then tested replacement cluster steps using KCS https://access.redhat.com/articles/7049245 and added new recovery cluster perf-2.

      Last step of Relocating back to Primary cluster failed and shows RBD app pods in creating mode because their PVC/PV are in a bad state.

      This is because when "perf-2" was added as a new recovery cluster it has Ceph pool IDs which have changed compared to the original replacement cluster "perf2".

      perf3 when RBD apps Relocated from:

      $ ceph df | grep -B 3 -A 1 cephblockpool

      — POOLS —
      POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
      ocs-storagecluster-cephblockpool 1 32 837 MiB 407 2.3 GiB 0.94 83 GiB
      .mgr 2 1 705 KiB 2 2.1 MiB 0 83 GiB

      new perf-2 where RBD apps Relocated to:

      $ ceph df | grep -B 2 cephblockpool
      POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
      .mgr 1 1 577 KiB 2 1.7 MiB 0 82 GiB
      ocs-storagecluster-cephblockpool 2 32 817 MiB 378 2.3 GiB 0.91 82 GiB

      Version of all relevant components (if applicable):
      OCP 4.14.11
      ODF 4.15 (build 146)
      ACM 2.9.2

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?
      Yes, RBD apps are in failed state.

      Is there any workaround available to the best of your knowledge?
      No

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?
      5

      Can this issue reproducible?
      Is intermittent because Ceph pool IDs do not always change when new recovery cluster created with ODF installed.

      Steps to Reproduce:
      0) Create RDR environment with hub cluster and 2 managed clusters with names perf2 and perf3 in ACM cluster view.
      1) Fail original perf2 cluster (pwr down all nodes)
      2) Failover perf2 rbd and cephfs apps to perf3
      3) Validate apps failed over correctly and are working as expected given perf2 down (replication between clusters is down)
      4) Delete DRCluster perf2 using hub cluster
      5) Validate s3Profile for perf2 removed from all VRGs on perf3
      6) Disable DR for all rbd and cephfs apps from perf2
      7) Remove all DR config from perf3 and hub cluster
      8) Remove submariner using ACM UI
      9) Detach perf2 cluster using ACM UI
      10) Create new cluster and add cluster using ACM UI as perf-2
      11) Install ODF 4.15 build 146 on perf-2
      12) Add submariner add-ons using ACM UI
      13) Install MCO (ODF 4.15 build 146) using hub cluster
      14) Create first DRPolicy
      15) Apply DR policy to rbd and cephfs apps originally on perf2
      16) Relocate rbd and cephfs apps back to perf-2

      Actual results:
      RBD apps failed because of bad PVC/PV state.

      Expected results:
      RBD apps are created with healthy PVC/PV state.

      Additional info:

      Shyam's diagnosis:

      The issue is in the Pool ID mapping comfig map for Ceph-csi (as follows):

      perf-2 (c1)
      ===========

      Pool ID for the RBD pool is 2 :pool 2 'ocs-storagecluster-cephblockpool' (from ceph osd pool ls detail)

      CSI mapping ConfigMap has this:$ oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml
      apiVersion: v1
      data:
      csi-mapping-config-json: '[{"ClusterIDMapping":

      {"openshift-storage":"openshift-storage"}

      ,"RBDPoolIDMapping":[

      {"1":"2"}

      ]}]'
      kind: ConfigMap

      perf3 (c2)
      ==========

      Pool ID for the RBD pool is 1: pool 1 'ocs-storagecluster-cephblockpool'

      CSI mapping ConfigMap has this:

      $ oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml
      apiVersion: v1
      data:
      csi-mapping-config-json: '[{"ClusterIDMapping":

      {"openshift-storage":"openshift-storage"}

      ,"RBDPoolIDMapping":[

      {"8":"1"}

      ]}]'
      kind: ConfigMap

      The PVC was initially created on the cluster that was lost and hence had this as the CSI Volume Handle volumeHandle: 0001-0011-openshift-storage-0000000000000008-06e1ec21-887c-4734-baf4-8f12a319ae0a

      Note the 000008 that is the Pool ID, which is not the pool ID in either of the current clusters.

      When this was failed over to perf3 the existing CSI mapping mapped ID 8 to ID 1 in perf3, this is correct.

      When we added the new cluster perf-2 neither of the CSI mappings work, as the new cluster has Pool ID 1, which is why the error messages also point to the pool with ID 8 on perf-2: pool 8 'ocs-storagecluster-cephobjectstore.rgw.log'.

      Anyway, this is an interesting issue, we need to map an non-existing Pool ID to one of the existing pool IDs in the current clusters. Ceph-CSI would need to fix this.

              mrajanna@redhat.com Madhu R
              aclewett Annette Clewett
              Neha Berry Neha Berry
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: