-
Epic
-
Resolution: Done
-
Critical
-
None
-
None
-
Recovery Validation for Spanned Clusters
-
To Do
-
Product / Portfolio Work
-
-
0% To Do, 0% In Progress, 100% Done
-
False
-
None
-
False
-
Not Selected
-
None
-
None
-
None
Epic Goal*
In order to support 4/5 node control-plane architectures on baremetal, we need to have periodic CI jobs similar to the blocking and informing jobs for metal but for 5 control-plane nodes.
E.g: See jobs in release payload status:
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.17.0-0.nightly/release/4.17.0-0.nightly-2024-07-29-061317
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ovn-assisted/1817807018258337792
Additionally we need to validate the manual disaster recovery steps for 5 node control-plane clusters and document any additional steps needed to recover the cluster from quorum loss failure states on an assisted-installer baremetal cluster.
E.g: https://github.com/openshift/assisted-service/blob/master/docs/user-guide/day2-master/411-healthy.md
Why is this important? (mandatory)
See https://issues.redhat.com/browse/OCPSTRAT-1199 for the background for use-cases for spanned clusters to improve resiliency.
Scenarios (mandatory)
See https://docs.google.com/presentation/d/1acJEeGdktwDIVMSdQne6KQD5DJxyx9CmOn8MMadDs9Q for failure states across two domains.
The main case that we need to test out the recovery steps for is a 3 + 2 control-plane node configuration across two failure domains where we lose the majority nodes that results in quorum loss.
Dependencies (internal and external) (mandatory)
None
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
- Development - etcd
- Documentation - etcd docs team
- QE -
- PX -
- Others -
Acceptance Criteria (optional)
Having a document that outlines the steps for the recovery of the 5 node control-plane across two failure domains from a quorum loss scenario.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
- CI Testing - Basic e2e automationTests are merged and completing successfully
- Documentation - Content development is complete.
- QE - Test scenarios are written and executed successfully.
- Technical Enablement - Slides are complete (if requested by PLM)
- Engineering Stories Merged
- All associated work items with the Epic are closed
- Epic status should be “Release Pending”
- links to