Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: odf-4.15
Component/s: rook
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2267731
Dev Approval:
?
QE Approval:
?
Release Note Text:

Hide
.RBD applications fail to Relocate when using stale Ceph pool IDs from replacement cluster
For the applications created before the new peer cluster is created, it is not possible to mount the RBD PVC because when a peer cluster is replaced, it is not possible to update the CephBlockPoolID’s mapping in the CSI configmap.

Workaround: Update the `rook-ceph-csi-mapping-config` configmap with cephBlockPoolID's mapping on the peer cluster that is not replaced. This enables mounting the RBD PVC for the application.

Show
.RBD applications fail to Relocate when using stale Ceph pool IDs from replacement cluster For the applications created before the new peer cluster is created, it is not possible to mount the RBD PVC because when a peer cluster is replaced, it is not possible to update the CephBlockPoolID’s mapping in the CSI configmap. Workaround: Update the `rook-ceph-csi-mapping-config` configmap with cephBlockPoolID's mapping on the peer cluster that is not replaced. This enables mounting the RBD PVC for the application.
Release Note Type:
Known Issue
Target Release:

odf-4.18
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem (please be detailed as possible and provide log
snippests):

Created RDR environment with hub cluster (perf1) and 2 managed clusters perf2 and perf3. Then tested replacement cluster steps using KCS https://access.redhat.com/articles/7049245 and added new recovery cluster perf-2.

Last step of Relocating back to Primary cluster failed and shows RBD app pods in creating mode because their PVC/PV are in a bad state.

This is because when "perf-2" was added as a new recovery cluster it has Ceph pool IDs which have changed compared to the original replacement cluster "perf2".

perf3 when RBD apps Relocated from:

$ ceph df | grep -B 3 -A 1 cephblockpool

— POOLS —
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
ocs-storagecluster-cephblockpool 1 32 837 MiB 407 2.3 GiB 0.94 83 GiB
.mgr 2 1 705 KiB 2 2.1 MiB 0 83 GiB

new perf-2 where RBD apps Relocated to:

$ ceph df | grep -B 2 cephblockpool
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 577 KiB 2 1.7 MiB 0 82 GiB
ocs-storagecluster-cephblockpool 2 32 817 MiB 378 2.3 GiB 0.91 82 GiB

Version of all relevant components (if applicable):
OCP 4.14.11
ODF 4.15 (build 146)
ACM 2.9.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, RBD apps are in failed state.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
5

Can this issue reproducible?
Is intermittent because Ceph pool IDs do not always change when new recovery cluster created with ODF installed.

Steps to Reproduce:
0) Create RDR environment with hub cluster and 2 managed clusters with names perf2 and perf3 in ACM cluster view.
1) Fail original perf2 cluster (pwr down all nodes)
2) Failover perf2 rbd and cephfs apps to perf3
3) Validate apps failed over correctly and are working as expected given perf2 down (replication between clusters is down)
4) Delete DRCluster perf2 using hub cluster
5) Validate s3Profile for perf2 removed from all VRGs on perf3
6) Disable DR for all rbd and cephfs apps from perf2
7) Remove all DR config from perf3 and hub cluster
8) Remove submariner using ACM UI
9) Detach perf2 cluster using ACM UI
10) Create new cluster and add cluster using ACM UI as perf-2
11) Install ODF 4.15 build 146 on perf-2
12) Add submariner add-ons using ACM UI
13) Install MCO (ODF 4.15 build 146) using hub cluster
14) Create first DRPolicy
15) Apply DR policy to rbd and cephfs apps originally on perf2
16) Relocate rbd and cephfs apps back to perf-2

Actual results:
RBD apps failed because of bad PVC/PV state.

Expected results:
RBD apps are created with healthy PVC/PV state.

Additional info:

Shyam's diagnosis:

The issue is in the Pool ID mapping comfig map for Ceph-csi (as follows):

perf-2 (c1)
===========

Pool ID for the RBD pool is 2 :pool 2 'ocs-storagecluster-cephblockpool' (from ceph osd pool ls detail)

CSI mapping ConfigMap has this:$ oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml
apiVersion: v1
data:
csi-mapping-config-json: '[{"ClusterIDMapping":

{"openshift-storage":"openshift-storage"}

,"RBDPoolIDMapping":[

{"1":"2"}

]}]'
kind: ConfigMap

perf3 (c2)
==========

Pool ID for the RBD pool is 1: pool 1 'ocs-storagecluster-cephblockpool'

CSI mapping ConfigMap has this:

$ oc get cm -n openshift-storage rook-ceph-csi-mapping-config -o yaml
apiVersion: v1
data:
csi-mapping-config-json: '[{"ClusterIDMapping":

{"openshift-storage":"openshift-storage"}

,"RBDPoolIDMapping":[

{"8":"1"}

]}]'
kind: ConfigMap

The PVC was initially created on the cluster that was lost and hence had this as the CSI Volume Handle volumeHandle: 0001-0011-openshift-storage-0000000000000008-06e1ec21-887c-4734-baf4-8f12a319ae0a

Note the 000008 that is the Pool ID, which is not the pool ID in either of the current clusters.

When this was failed over to perf3 the existing CSI mapping mapped ID 8 to ID 1 in perf3, this is correct.

When we added the new cluster perf-2 neither of the CSI mappings work, as the new cluster has Pool ID 1, which is why the error messages also point to the pool with ID 8 on perf-2: pool 8 'ocs-storagecluster-cephobjectstore.rgw.log'.

Anyway, this is an interesting issue, we need to map an non-existing Pool ID to one of the existing pool IDs in the current clusters. Ceph-CSI would need to fix this.

external trackers

Red Hat Issue Tracker RHSTOR-5604

Assignee:: Madhu R

Reporter:: Annette Clewett

QA Contact:: Neha Berry

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2024/03/04 5:05 PM

Updated:: 2024/11/25 2:33 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty