Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-1502

[4.16.z][IBM Support] Rook doesn't delete the blocking PDB's if one OSD is down, even after cluster recovers from OSD failure

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • odf-4.17, odf-4.16
    • rook
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Committed
    • ?
    • x86_64
    • ?
    • ?
    • Critical
    • Proposed
    • None

       

      Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

      Rook doesn't delete the blocking PDB's if one OSD is down, even after cluster recovers from OSD failure.

       

      Here's the sequence of events:

      1. OSD goes down due to a Hardware failure.
      2. Rook detects this and assumes a node drain
        • Deletes the default OSD PDB with maxUnavailable: 1
        • Creates block PDB's for all other failure domains.
        • Sets noout flag in the Ceph cluster.
      3. After 30 minutes, Rook does unset of the noout flag.
      4. After 10 minutes, the OSD is marked out of the cluster and data rebalancing starts.
      5. The cluster rebalancing completes and all PG's goes to active+clean.
      6. However, the blocking PDB is not deleted and it blocks any further MCP rollout

       

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

      Internal

       

      The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

      OCP 4.16

      ODF 4.16

       

      Does this issue impact your ability to continue to work with the product?

      Yes, MCP rollout gets stuck

       

      Is there any workaround available to the best of your knowledge?

      Yes, purge the OSD completely from the cluster

       

      Can this issue be reproduced? If so, please provide the hit rate

      Yes, 100%

       

      If this is a regression, please provide more details to justify this:

      Yes

       

      Steps to Reproduce:

      1. Chose an OSD and forcefully delete its PV.

      2. OSD gets marked down and the OSD pod gets stuck in Pending state.

      3. Trigger an MCP rollout, the rollout will be stuck forever.

       

      Actual results:

      MCP rollout gets stuck

       

      Expected results:

      MCP rollout should not get stuck

       

      Logs collected and log location:

      https://ibm.ent.box.com/folder/304847881914?tc=collab-folder-invite-treatment-b

       

      Additional info:

       Case: TS018085672

       

              sapillai Santosh Pillai
              rhn-support-assingh Ashish Singh
              Neha Berry Neha Berry
              Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

                Created:
                Updated: