Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-2060

MON failover behavior during storage node replacement

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • odf-4.16
    • Documentation
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • ?
    • ?
    • None

      For Chapter 2 – OpenShift Data Foundation deployed using local storage devices, a few suggestions could help make the procedure clearer and easier to follow.

      1. Add to the Prerequisites section, before removing a healthy node:

      • Make sure the monitor quorum is healthy
      • Confirm that the remaining OSDs and placement groups (PGs) are in a healthy state

      This helps ensure the Ceph cluster is in a good state before removing any node.

      2. Add a note before Step 3 (cordon):

      As part of the documented process, the monitor pod on the node being removed is manually scaled to 0. Once that happens, the Rook operator starts a 10-minute timer, after which it will trigger a monitor failover creating a new canary and monitor pod on another node, but if there's no suitable nodes are available (e.g. only 2 ODF nodes remain), the new monitor pod will stay in Pending until a valid node becomes available.

      This helps to understand why a mon pod could be in Pending status during the node removal.

              kbg@redhat.com Kusuma BG
              rhn-support-jclaretm Jorge Claret Membrado
              Neha Berry Neha Berry
              Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

                Created:
                Updated: