-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
odf-4.15
-
None
Description of problem (please be detailed as possible and provide log
snippests):
Created RDR environment with hub cluster (perf1) and 2 managed clusters perf2 and perf3. Then tested replacement cluster steps using KCS https://access.redhat.com/articles/7049245 and added new recovery cluster perf-2.
Last step of Relocating back to Primary cluster failed and shows RBD app pods in creating mode because their PVC/PV are in a bad state.
This is because when "perf-2" was added as a new recovery cluster it has Ceph pool IDs which have changed compared to the original replacement cluster "perf2".
perf3 when RBD apps Relocated from:
$ ceph df | grep -B 3 -A 1 cephblockpool
— POOLS —
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
ocs-storagecluster-cephblockpool 1 32 837 MiB 407 2.3 GiB 0.94 83 GiB
.mgr 2 1 705 KiB 2 2.1 MiB 0 83 GiB
new perf-2 where RBD apps Relocated to:
$ ceph df | grep -B 2 cephblockpool
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 577 KiB 2 1.7 MiB 0 82 GiB
ocs-storagecluster-cephblockpool 2 32 817 MiB 378 2.3 GiB 0.91 82 GiB
Version of all relevant components (if applicable):
OCP 4.14.11
ODF 4.15 (build 146)
ACM 2.9.2
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, RBD apps are in failed state.
Is there any workaround available to the best of your knowledge?
No
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
5
Can this issue reproducible?
Is intermittent because Ceph pool IDs do not always change when new recovery cluster created with ODF installed.
Steps to Reproduce:
0) Create RDR environment with hub cluster and 2 managed clusters with names perf2 and perf3 in ACM cluster view.
1) Fail original perf2 cluster (pwr down all nodes)
2) Failover perf2 rbd and cephfs apps to perf3
3) Validate apps failed over correctly and are working as expected given perf2 down (replication between clusters is down)
4) Delete DRCluster perf2 using hub cluster
5) Validate s3Profile for perf2 removed from all VRGs on perf3
6) Disable DR for all rbd and cephfs apps from perf2
7) Remove all DR config from perf3 and hub cluster
8) Remove submariner using ACM UI
9) Detach perf2 cluster using ACM UI
10) Create new cluster and add cluster using ACM UI as perf-2
11) Install ODF 4.15 build 146 on perf-2
12) Add submariner add-ons using ACM UI
13) Install MCO (ODF 4.15 build 146) using hub cluster
14) Create first DRPolicy
15) Apply DR policy to rbd and cephfs apps originally on perf2
16) Relocate rbd and cephfs apps back to perf-2
Actual results:
RBD apps failed because of bad PVC/PV state.
Expected results:
RBD apps are created with healthy PVC/PV state.
Additional info:
Shyam's diagnosis:
The issue is in the Pool ID mapping comfig map for Ceph-csi (as follows):
perf-2 (c1)
===========
Pool ID for the RBD pool is 2 :pool 2 'ocs-storagecluster-cephblockpool' (from ceph osd pool ls detail)
CSI mapping ConfigMap has this:$ oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml
apiVersion: v1
data:
csi-mapping-config-json: '[{"ClusterIDMapping":
,"RBDPoolIDMapping":[
{"1":"2"}]}]'
kind: ConfigMap
perf3 (c2)
==========
Pool ID for the RBD pool is 1: pool 1 'ocs-storagecluster-cephblockpool'
CSI mapping ConfigMap has this:
$ oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml
apiVersion: v1
data:
csi-mapping-config-json: '[{"ClusterIDMapping":
,"RBDPoolIDMapping":[
{"8":"1"}]}]'
kind: ConfigMap
The PVC was initially created on the cluster that was lost and hence had this as the CSI Volume Handle volumeHandle: 0001-0011-openshift-storage-0000000000000008-06e1ec21-887c-4734-baf4-8f12a319ae0a
Note the 000008 that is the Pool ID, which is not the pool ID in either of the current clusters.
When this was failed over to perf3 the existing CSI mapping mapped ID 8 to ID 1 in perf3, this is correct.
When we added the new cluster perf-2 neither of the CSI mappings work, as the new cluster has Pool ID 1, which is why the error messages also point to the pool with ID 8 on perf-2: pool 8 'ocs-storagecluster-cephobjectstore.rgw.log'.
Anyway, this is an interesting issue, we need to map an non-existing Pool ID to one of the existing pool IDs in the current clusters. Ceph-CSI would need to fix this.
- external trackers