Loading...

XML

Word

Printable

Type: Feature
Resolution: Done
Priority: Critical
Fix Version/s: openshift-4.17
Affects Version/s: None
Component/s: API & Datastore, etcd
Labels:
- 4.17-candidate
- FAC:Red
- dkgap
- dkgap-ops
- etcd

Work Type:
BU Product Work
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Size:
M
Target Version:

openshift-4.17

Risk Score:
0

Discussion Needed:

Program Call

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Priority Data:
PX Impact Score:
PX Technical Impact:
PX Impact Range:
PX Scheduling Request:
PX Review Complete:

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

Customers with hard requirements for active-active deployments across two locations requiring to support stateful traditional applications (e.g., OCP Virt VMs that can only run a single instance) have dependencies on the underlying infrastructure to provide the availability. These use cases are common when deploying the VMs on traditional virtualization stacks. An OpenShift cluster is deployed as a stretched or spanned cluster with a control-plane distribution of 2+1 or 1+1+1 (when using an arbitrary site) to support those scenarios.

During failure scenarios on the data center hosting the majority of control plane nodes, the surviving control plane node becomes the only node with the latest configuration and state of all the objects/resources on the cluster. The recovery procedure in a disaster scenario for this configuration requires the single surviving node to become read-write and to have the only copy of the etcd. Should that node fail, it will be a catastrophic failure. This is more critical when OCP-Virt is also hosting the stateful VMs.

To increase resiliency and reduce risk for this scenario during this type of failure, we need to extend the number of control plane nodes to support 2+2 and 3+2 deployments. In this scenario, a failure of a site with the majority of the nodes will still have two copies of etcd in read-only in the surviving location, providing higher assurance for the recoverability of the cluster.

Today, the cluster-etcd-operator can handle up to 5-etcd instances when detecting up to 5 control plane nodes. This procedure is used as part of the automation during vertical scaling of the control plane on environments with control-plane MachineSet. For deployments where MachineSet is not available (e.g., bare-metal, agent-based installer), the cluster-etcd-operator is not automatically triggered for doing the vertical scaling of the control plane but will scale the etcd-peers if the control-plane nodes are manually added to the environment. This is the procedure we want to validate and support for bare-metal clusters with stretched control-planes.

This feature is only for baremetal and mainly for OCP Virtualization use cases.

Goals (aka. expected user outcomes)

Validate and support the use of 4-nodes and 5-nodes control-plane architecture for bare-metal clusters on stretched control-plane configurations and the following restrictions:

- bare-metal control-plane
- bare-metal deployment using assisted-service or agent-based installer
- Same Layer3 network across locations
- Max Latency across nodes < 10ms
- Min bandwidth 10Gbps across nodes
- etcd must be on an SSD or NVME disk

Requirements (aka. Acceptance Criteria):

Performance and scaling should have minimal (<10%) degradation when compared to perf tests on existing HA clusters
Validate and update documentation on manual recovery procedures on the control plane in case of quorum loss

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	self-managed
Classic (standalone cluster)	Classic
Hosted control planes	N/A
Multi node, Compact (three node), or Single node (SNO), or all	multi-node
Connected / Restricted Network	N/A
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	x86_64
Operator compatibility	N/A
Backport needed (list applicable versions)	no
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	unknown
Other (please specify)	Observability (Update to Prometheus rules for control-plane)

Use Cases

The only use case under consideration are standard multi-node bare-metal deployments with stretched control-plane installed with assisted-service/agent-based installer.

Out of Scope

Any other use-case or installation mode.

Documentation Considerations

The documentation must include a clear step-by-step validated recovery procedure for quorum loss.

relates to

RFE-5311 Allow configuration of Hosted Control Plane component replicas

Backlog

RFE-5310 Allow providing Agent labels from BareMetalHost object

Closed

OCPSTRAT-539 Enhance recovery procedure for full control plane failure

Closed

OCPSTRAT-1219 Allow 5-node control planes in day 1 with Agent-Based Installer

Closed

OCPSTRAT-1395 Automated control-plane recovery from expired certificates (hibernation)

Closed

links to

openshift/assisted-service#6917: MGMT-19080, MGMT-18590: Enable installation of 3-5 control plane cluster in day1

Technical Enablement OnePager

mentioned on

Merge request - Enable control_plane_replicas rule for OCP 4.17 [INSIGHTOCP-1919]

Merge request - Updated US source to: 6d6af2d MGMT-19080: Enable streched cluster installation in day1 (#6917)

(3 links to, 2 mentioned on)

Assignee:: Ramon Acedo

Reporter:: Ramon Acedo

Contributors:: Adel Zaalouk, David Eads, Peter Lauterbach, Ramon Acedo, Venkat Kolli (Inactive)

Developer:: Maciej Szulik (Inactive)

QA Contact:: Ge Liu

Doc Contact:: Matthew Werner

Architect:: Maciej Szulik (Inactive)

Product Manager:: William Caban

Votes:: 6 Vote for this issue

Watchers:: 51 Start watching this issue

Created:: 2024/02/23 9:34 PM

Updated:: 2024/12/10 8:30 PM

Resolved:: 2024/09/26 8:16 AM

Target end:: 2024/08/17

Details

Description

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases

Out of Scope

Documentation Considerations

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates