-
Epic
-
Resolution: Done
-
Normal
-
None
-
LVMO recover from failure
-
Product / Portfolio Work
-
-
0% To Do, 0% In Progress, 100% Done
-
False
-
-
False
-
Not Selected
-
None
-
None
-
Backlog Refinement
Goal
Be able to recover from a node or disk failure when using LVMO
Problem
Currently we don't have documentation on how to recover from a failure event like a disk or node failure. We need to fix this.
Why is this important?
Nodes or Disks eventually fail and we need to have supported guidelines on how to recover from these events
Dependencies
Prioritized Scenarios
In Scope
- Recovery for LVMO installed via ACM or manually via OperatorHub
- Recovery on node loss (for SNO with additional workers)
- For any node loss (master or worker)
- Excluding OCP-specific recovery steps (just focus on the storage bit)
- Recovery on disk loss
- OK to throw away the whole LVM VG, data loss is expected
- In the end the whole cluster should be back in usable state, where new PVs can be created and used on all nodes
Not in Scope
- Data Recovery
Documentation Requirements
- Emphasize that LVMO itself has no replication, thus data loss is expected for disk or node loss
Customers
Customer Facing Story
As an administrator, I want to bring my SNO cluster back into usable state after a failure event.
What does success look like?
On disk or node loss, we want to recover the cluster to a usable state without reinstalling the whole node (if avoidable)
Open Questions
- How much can we automate? Can we automatically detect a failed VG? Can we add a "button" somewhere that would auto-fix the LVM layer when the administrator detects a disk failure?
- Can we use the Health info of disks to help administrators decide when a disk has failed?
KNIP-1770
https://kubernetes-csi.github.io/docs/volume-health-monitor.html - KNIP-1770
- links to