XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Normal
Fix Version/s: openshift-4.14
Affects Version/s: None
Component/s: Logical Volume Manager Storage
Labels:

Epic Name:
LVMO recover from failure
Activity Type:
Product / Portfolio Work
Parent Link:
OCPSTRAT-43LVM storage user experience enhancements
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Size:
None

Target Version:

openshift-4.14
Release Blocker:
None
Discussion Needed:

Backlog Refinement

Goal

Be able to recover from a node or disk failure when using LVMO

Problem

Currently we don't have documentation on how to recover from a failure event like a disk or node failure. We need to fix this.

Why is this important?

Nodes or Disks eventually fail and we need to have supported guidelines on how to recover from these events

Dependencies

Prioritized Scenarios

In Scope

Recovery for LVMO installed via ACM or manually via OperatorHub
Recovery on node loss (for SNO with additional workers)
- For any node loss (master or worker)
- Excluding OCP-specific recovery steps (just focus on the storage bit)
Recovery on disk loss
- OK to throw away the whole LVM VG, data loss is expected
In the end the whole cluster should be back in usable state, where new PVs can be created and used on all nodes

Not in Scope

Data Recovery

Documentation Requirements

Emphasize that LVMO itself has no replication, thus data loss is expected for disk or node loss

Customers

Customer Facing Story

As an administrator, I want to bring my SNO cluster back into usable state after a failure event.

What does success look like?

On disk or node loss, we want to recover the cluster to a usable state without reinstalling the whole node (if avoidable)

Open Questions

How much can we automate? Can we automatically detect a failed VG? Can we add a "button" somewhere that would auto-fix the LVM layer when the administrator detects a disk failure?
Can we use the Health info of disks to help administrators decide when a disk has failed?
KNIP-1770
https://kubernetes-csi.github.io/docs/volume-health-monitor.html
KNIP-1770

links to

openshift/openshift-docs#63394: TELCODOCS-1111: D/S: OCPVE-218 LVMO recover from failure

openshift/openshift-docs#63638: [enterprise-4.13] TELCODOCS-1111: D/S: OCPVE-218 LVMO recover from failure

openshift/openshift-docs#63639: [enterprise-4.12] TELCODOCS-1111: D/S: OCPVE-218 LVMO recover from failure

openshift/openshift-docs#63640: [enterprise-4.14] TELCODOCS-1111: D/S: OCPVE-218 LVMO recover from failure

openshift/openshift-docs#63641: [OSDOCS-7505] Merge pull request #63394 from johnwilkins/TELCODOCS-1111

Assignee:: Daniel Macpherson

Reporter:: Chris Blum

Contributors:: None

QA Contact:: None

Doc Contact:: John Wilkins

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2022/09/14 12:00 PM

Updated:: 2025/09/16 11:29 AM

Resolved:: 2023/08/22 7:55 AM

Details

Description

Goal

Problem

Why is this important?

Dependencies

Prioritized Scenarios

In Scope

Not in Scope

Documentation Requirements

Customers

Customer Facing Story

What does success look like?

Open Questions

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates