XML

Word

Printable

Type: Epic
Resolution: Done-Errata
Priority: Critical
Fix Version/s: openshift-4.15
Affects Version/s: None
Component/s: Logical Volume Manager Storage
Labels:
- ocpedge-plan
- ocpve-plan

Epic Name:
Enable multi-node LVMS
Epic Status:
Done
Activity Type:
Product / Portfolio Work
Parent Link:
OCPSTRAT-390LVM Storage on multi node clusters
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Green

Status Summary:

Hide

2023-12-15:
Dev - Green - All 4.15 NVRs should now contain multi-node support for lvms by default
Docs - Green - All references to single-node in LVMS docs that also apply to Multi-Node need to be removed, PRs in Progress
QE - Green - QE Test cases created, Test activity ongoing

Show

2023-12-15 : Dev - Green - All 4.15 NVRs should now contain multi-node support for lvms by default Docs - Green - All references to single-node in LVMS docs that also apply to Multi-Node need to be removed, PRs in Progress QE - Green - QE Test cases created, Test activity ongoing

Size:
M

Target Version:

openshift-4.15
Release Blocker:
None

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Introduce a technically stable version of multi-node LVMS into our regular releases, and making sure it can run aside from our SNO configuration

Why is this important?

Multi-Node Clusters are the norm in kubernetes, and supporting them opens the solution for the majority of standard K8s topologies
Customers are already requesting Support Exceptions

Scenarios

LVMS is deployed on multiple nodes, where every node provides a single point of failure and is not highly-available by default. The administrator of the nodes is responsible for providing HA storage that can be consumed by LVM volume groups, or the Application developer makes sure to use HA and uses different volumes to replicate data across nodes.
LVMS should be able to run stable on a Multi-Node Environment, with the LVMCluster object being responsible to trigger a DaemonSet which deploys the VolumeGroup on every node

Acceptance Criteria

CI - MUST be running successfully with tests automated - Especially we will need a multi-node test pipeline that can verify various edge cases around nodes becoming unavailable.
We will have to define outage scenarios and how to properly recover from them within LVMS. Especially we will need to define what happens when an entire node falls out of the cluster and the LVMCluster object will have to be recovered
Release Technical Enablement - Provide necessary release enablement details and documents.
The deployed TopoLVM instance needs to use these VolumeGroups as deviceClasses and smartly uses Kubernetes Capacity tracking to correctly determine where to setup Bindings based on CSI Topology
Once LVMCluster is created, it should make use of deviceSelector to correctly identify devices as pvs that have to be initialized on each node. This is the basis for our configuration.

Dependencies (internal and external)

E2E Test Pipeline setup for Multi-Node deployment as base for our test scaffolding

Previous Work (Optional):

Open questions::

How do we cover all edge cases around multi-node failure scenarios?
Are there any Status API changes that we should introduce to more clearly reflect the status of the VolumeGroups per Node in the Cluster.

Done Checklist

CI - CI is running, tests are automated and merged. LVMS: https://issues.redhat.com/browse/OCPVE-588 openshift-release: https://github.com/openshift/release/pull/42554
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue> - See Linked https://issues.redhat.com/browse/OCPBUGS-13558 https://issues.redhat.com/browse/OCPBUGS-17852 https://issues.redhat.com/browse/OCPBUGS-17853
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue> - See https://github.com/openshift/lvm-operator/pull/384
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>