-
Bug
-
Resolution: Unresolved
-
Critical
-
odf-4.13
-
None
Describe the issue:
Restoring ceph-monitor quorum procedure is not correct.
The bad mons cannot be deleted from the monmap because permission issue
Describe the task you were trying to accomplish:
Test Procedure:
1.Stop 2 worker nodes
oviner:auth$ oc get nodes
NAME STATUS ROLES AGE VERSION
compute-0 NotReady worker 3d v1.27.4+deb2c60
compute-1 NotReady worker 3d v1.27.4+deb2c60
compute-2 Ready worker 3d v1.27.4+deb2c60
control-plane-0 Ready control-plane,master 3d1h v1.27.4+deb2c60
control-plane-1 Ready control-plane,master 3d1h v1.27.4+deb2c60
control-plane-2 Ready control-plane,master 3d1h v1.27.4+deb2c60
oviner:auth$ oc get pods -l app=rook-ceph-mon
NAME READY STATUS RESTARTS AGE
rook-ceph-mon-a-576dc56947-l2cqx 0/2 Pending 0 20h
rook-ceph-mon-b-569d6c5877-fvxf2 2/2 Terminating 0 21h
rook-ceph-mon-b-569d6c5877-hclhg 0/2 Pending 0 20h
rook-ceph-mon-c-6646b847ff-r9m4j 2/2 Running 1 (12h ago) 3d
2.Stop the rook-ceph-operator so that the mons are not failed over when you are modifying the monmap.
$ oc -n openshift-storage scale deployment rook-ceph-operator --replicas=0
deployment.apps/rook-ceph-operator scaled
3. Open the YAML file and copy the command and arguments from the mon container
$ oc -n openshift-storage get deployment rook-ceph-mon-c -o yaml > rook-ceph-mon-c-deployment.yaml
4.Cleanup the copied command and args fields to form a pastable command as follows:
ceph-mon \
--fsid=8b24e1e2-00f9-4d81-a721-4ee4095fba99 \
--keyring=/etc/ceph/keyring-store/keyring \
--default-log-to-stderr=true \
--default-err-to-stderr=true \
--default-mon-cluster-log-to-stderr=true \
--default-log-stderr-prefix=debug \
--default-log-to-file=false \
--default-mon-cluster-log-to-file=false \
--mon-host=$(ROOK_CEPH_MON_HOST) \
--mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS) \
--id=c \
--setuser=ceph \
--setgroup=ceph \
--foreground \
--public-addr=172.30.53.157 \
--setuser-match-path=/var/lib/ceph/mon/ceph-c/store.db \
--public-bind-addr=$(ROOK_POD_IP) \
--extract-monmap=${monmap_path}
5. Patch the rook-ceph-mon-c Deployment to stop the working of this mon without deleting the mon pod.
$ oc -n openshift-storage patch deployment rook-ceph-mon-c --type='json' -p '[
]'
$ oc -n openshift-storage patch deployment rook-ceph-mon-c -p '{"spec": {"template": {"spec": {"containers": [
]}}}}'
6.Connect to the pod of a healthy mon [mon-c]:
$ oc -n openshift-storage exec -it rook-ceph-mon-c-765cbb446f-4xgzw bash
[root@rook-ceph-mon-c-765cbb446f-4xgzw ceph]# monmap_path=/tmp/monmap
7.Review the contents of the monmap.
[root@rook-ceph-mon-c-765cbb446f-4xgzw ceph]# monmaptool --print /tmp/monmap
monmaptool: monmap file /tmp/monmap
epoch 3
fsid 8b24e1e2-00f9-4d81-a721-4ee4095fba99
last_changed 2023-08-21T10:15:51.349720+0000
created 2023-08-21T10:13:54.902037+0000
min_mon_release 17 (quincy)
election_strategy: 1
0: v2:172.30.122.31:3300/0 mon.a
1: v2:172.30.85.192:3300/0 mon.b
2: v2:172.30.53.157:3300/0 mon.c
8.Remove the bad mons from the monmap [Failed]
[root@rook-ceph-mon-c-765cbb446f-4xgzw ceph]# monmaptool ${monmap_path} --rm a
monmaptool: monmap file /tmp/monmap
monmaptool: removing a
monmaptool: writing epoch 3 to /tmp/monmap (2 monitors)
bufferlist::write_file(/tmp/monmap): failed to open file: (13) Permission denied
monmaptool: error writing to '/tmp/monmap': (13) Permission denied
Suggestions for improvement:
We need to find the correct procedure for restoring ceph-monitor quorum.
Chapter/Section Number and Title:
Chapter 12. Restoring ceph-monitor quorum in OpenShift Data Foundation
Product Version:
ODF Version: odf-operator.v4.14.0-111.stable
OCP Version: 4.14.0-0.nightly-2023-08-11-055332
platform: Vsphere
Environment Details:
Any other versions of this document that also needs this update:
Additional information:
for more info:
https://docs.google.com/document/d/1Xu6L4ibi-0PWD9Y8ezeXRQ-TsHRPtnHH-eRaw0pDRec/edit