Uploaded image for project: 'OpenShift Edge Enablement'
  1. OpenShift Edge Enablement
  2. OCPEDGE-29

LVMS documentation on how to recover from failure

XMLWordPrintable

    • LVMO recover from failure
    • Product / Portfolio Work
    • OCPSTRAT-43LVM storage user experience enhancements
    • 0% To Do, 0% In Progress, 100% Done
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • None
    • None
    • Backlog Refinement

      Goal

      Be able to recover from a node or disk failure when using LVMO

      Problem

      Currently we don't have documentation on how to recover from a failure event like a disk or node failure. We need to fix this.

      Why is this important?

      Nodes or Disks eventually fail and we need to have supported guidelines on how to recover from these events

      Dependencies 

       

      Prioritized Scenarios

      In Scope

      • Recovery for LVMO installed via ACM or manually via OperatorHub
      • Recovery on node loss (for SNO with additional workers)
        • For any node loss (master or worker)
        • Excluding OCP-specific recovery steps (just focus on the storage bit)
      • Recovery on disk loss
        • OK to throw away the whole LVM VG, data loss is expected
      • In the end the whole cluster should be back in usable state, where new PVs can be created and used on all nodes

      Not in Scope

      • Data Recovery

      Documentation Requirements

      • Emphasize that LVMO itself has no replication, thus data loss is expected for disk or node loss

      Customers

       

      Customer Facing Story

      As an administrator, I want to bring my SNO cluster back into usable state after a failure event.

      What does success look like?

      On disk or node loss, we want to recover the cluster to a usable state without reinstalling the whole node (if avoidable)

      Open Questions

      1. How much can we automate? Can we automatically detect a failed VG? Can we add a "button" somewhere that would auto-fix the LVM layer when the administrator detects a disk failure?
      2. Can we use the Health info of disks to help administrators decide when a disk has failed?
        KNIP-1770
        https://kubernetes-csi.github.io/docs/volume-health-monitor.html
      3. KNIP-1770

              rhn-engineering-dmacpher Daniel Macpherson
              rhn-stor-cblum Chris Blum
              None
              None
              John Wilkins John Wilkins
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: