-
Epic
-
Resolution: Done-Errata
-
Critical
-
None
-
Enable multi-node LVMS
-
Product / Portfolio Work
-
-
0% To Do, 0% In Progress, 100% Done
-
False
-
-
False
-
Green
-
-
M
-
None
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Epic Goal
- Introduce a technically stable version of multi-node LVMS into our regular releases, and making sure it can run aside from our SNO configuration
Why is this important?
- Multi-Node Clusters are the norm in kubernetes, and supporting them opens the solution for the majority of standard K8s topologies
- Customers are already requesting Support Exceptions
Scenarios
- LVMS is deployed on multiple nodes, where every node provides a single point of failure and is not highly-available by default. The administrator of the nodes is responsible for providing HA storage that can be consumed by LVM volume groups, or the Application developer makes sure to use HA and uses different volumes to replicate data across nodes.
- LVMS should be able to run stable on a Multi-Node Environment, with the LVMCluster object being responsible to trigger a DaemonSet which deploys the VolumeGroup on every node
Acceptance Criteria
- CI - MUST be running successfully with tests automated - Especially we will need a multi-node test pipeline that can verify various edge cases around nodes becoming unavailable.
- We will have to define outage scenarios and how to properly recover from them within LVMS. Especially we will need to define what happens when an entire node falls out of the cluster and the LVMCluster object will have to be recovered
- Release Technical Enablement - Provide necessary release enablement details and documents.
- The deployed TopoLVM instance needs to use these VolumeGroups as deviceClasses and smartly uses Kubernetes Capacity tracking to correctly determine where to setup Bindings based on CSI Topology
- Once LVMCluster is created, it should make use of deviceSelector to correctly identify devices as pvs that have to be initialized on each node. This is the basis for our configuration.
Dependencies (internal and external)
- E2E Test Pipeline setup for Multi-Node deployment as base for our test scaffolding
Previous Work (Optional):
- …
Open questions::
- How do we cover all edge cases around multi-node failure scenarios?
- Are there any Status API changes that we should introduce to more clearly reflect the status of the VolumeGroups per Node in the Cluster.
Done Checklist
- CI - CI is running, tests are automated and merged. LVMS: https://issues.redhat.com/browse/OCPVE-588 openshift-release: https://github.com/openshift/release/pull/42554
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue> - See Linked https://issues.redhat.com/browse/OCPBUGS-13558 https://issues.redhat.com/browse/OCPBUGS-17852 https://issues.redhat.com/browse/OCPBUGS-17853
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue> - See https://github.com/openshift/lvm-operator/pull/384
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>
- is blocked by
-
OCPBUGS-13558 LVMS: After node loss, lvmCluster resource still has the entry for the lost node
-
- Closed
-
- links to
-
RHBA-2024:126443 LVMS 4.15 Bug Fix and Enhancement update
- mentioned on
(4 links to, 1 mentioned on)