Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-599

[2235311] [DOC] Restoring ceph-monitor quorum procedure, The bad mons cannot be deleted from the monmap because permission issue

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.13.13
    • odf-4.13
    • Documentation
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • Proposed
    • None

      Describe the issue:
      Restoring ceph-monitor quorum procedure is not correct.
      The bad mons cannot be deleted from the monmap because permission issue

      Describe the task you were trying to accomplish:
      Test Procedure:
      1.Stop 2 worker nodes
      oviner:auth$ oc get nodes
      NAME STATUS ROLES AGE VERSION
      compute-0 NotReady worker 3d v1.27.4+deb2c60
      compute-1 NotReady worker 3d v1.27.4+deb2c60
      compute-2 Ready worker 3d v1.27.4+deb2c60
      control-plane-0 Ready control-plane,master 3d1h v1.27.4+deb2c60
      control-plane-1 Ready control-plane,master 3d1h v1.27.4+deb2c60
      control-plane-2 Ready control-plane,master 3d1h v1.27.4+deb2c60

      oviner:auth$ oc get pods -l app=rook-ceph-mon
      NAME READY STATUS RESTARTS AGE
      rook-ceph-mon-a-576dc56947-l2cqx 0/2 Pending 0 20h
      rook-ceph-mon-b-569d6c5877-fvxf2 2/2 Terminating 0 21h
      rook-ceph-mon-b-569d6c5877-hclhg 0/2 Pending 0 20h
      rook-ceph-mon-c-6646b847ff-r9m4j 2/2 Running 1 (12h ago) 3d

      2.Stop the rook-ceph-operator so that the mons are not failed over when you are modifying the monmap.
      $ oc -n openshift-storage scale deployment rook-ceph-operator --replicas=0
      deployment.apps/rook-ceph-operator scaled

      3. Open the YAML file and copy the command and arguments from the mon container
      $ oc -n openshift-storage get deployment rook-ceph-mon-c -o yaml > rook-ceph-mon-c-deployment.yaml

      4.Cleanup the copied command and args fields to form a pastable command as follows:
      ceph-mon \
      --fsid=8b24e1e2-00f9-4d81-a721-4ee4095fba99 \
      --keyring=/etc/ceph/keyring-store/keyring \
      --default-log-to-stderr=true \
      --default-err-to-stderr=true \
      --default-mon-cluster-log-to-stderr=true \
      --default-log-stderr-prefix=debug \
      --default-log-to-file=false \
      --default-mon-cluster-log-to-file=false \
      --mon-host=$(ROOK_CEPH_MON_HOST) \
      --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS) \
      --id=c \
      --setuser=ceph \
      --setgroup=ceph \
      --foreground \
      --public-addr=172.30.53.157 \
      --setuser-match-path=/var/lib/ceph/mon/ceph-c/store.db \
      --public-bind-addr=$(ROOK_POD_IP) \
      --extract-monmap=${monmap_path}

      5. Patch the rook-ceph-mon-c Deployment to stop the working of this mon without deleting the mon pod.
      $ oc -n openshift-storage patch deployment rook-ceph-mon-c --type='json' -p '[

      {"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}

      ]'
      $ oc -n openshift-storage patch deployment rook-ceph-mon-c -p '{"spec": {"template": {"spec": {"containers": [

      {"name": "mon", "command": ["sleep", "infinity"], "args": []}

      ]}}}}'

      6.Connect to the pod of a healthy mon [mon-c]:
      $ oc -n openshift-storage exec -it rook-ceph-mon-c-765cbb446f-4xgzw bash
      [root@rook-ceph-mon-c-765cbb446f-4xgzw ceph]# monmap_path=/tmp/monmap

      7.Review the contents of the monmap.
      [root@rook-ceph-mon-c-765cbb446f-4xgzw ceph]# monmaptool --print /tmp/monmap
      monmaptool: monmap file /tmp/monmap
      epoch 3
      fsid 8b24e1e2-00f9-4d81-a721-4ee4095fba99
      last_changed 2023-08-21T10:15:51.349720+0000
      created 2023-08-21T10:13:54.902037+0000
      min_mon_release 17 (quincy)
      election_strategy: 1
      0: v2:172.30.122.31:3300/0 mon.a
      1: v2:172.30.85.192:3300/0 mon.b
      2: v2:172.30.53.157:3300/0 mon.c

      8.Remove the bad mons from the monmap [Failed]
      [root@rook-ceph-mon-c-765cbb446f-4xgzw ceph]# monmaptool ${monmap_path} --rm a
      monmaptool: monmap file /tmp/monmap
      monmaptool: removing a
      monmaptool: writing epoch 3 to /tmp/monmap (2 monitors)
      bufferlist::write_file(/tmp/monmap): failed to open file: (13) Permission denied
      monmaptool: error writing to '/tmp/monmap': (13) Permission denied

      Suggestions for improvement:
      We need to find the correct procedure for restoring ceph-monitor quorum.

      Document URL:
      https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.13/html/troubleshooting_openshift_data_foundation/restoring-ceph-monitor-quorum-in-openshift-data-foundation_rhodf#doc-wrapper

      Chapter/Section Number and Title:
      Chapter 12. Restoring ceph-monitor quorum in OpenShift Data Foundation

      Product Version:
      ODF Version: odf-operator.v4.14.0-111.stable
      OCP Version: 4.14.0-0.nightly-2023-08-11-055332
      platform: Vsphere

      Environment Details:

      Any other versions of this document that also needs this update:

      Additional information:
      for more info:
      https://docs.google.com/document/d/1Xu6L4ibi-0PWD9Y8ezeXRQ-TsHRPtnHH-eRaw0pDRec/edit

              asriram@redhat.com Anjana Sriram
              oviner@redhat.com Oded Viner
              Anjana Sriram
              Neha Berry Neha Berry
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: