[OCPSTRAT-1543] Support 2+1 node Openshift cluster with Remote Arbiter node (ORA) - Red Hat Issue Tracker

Type: Feature
Resolution: Won't Do
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Component/s: etcd
Labels:

Work Type:
Strategic Portfolio Work
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Parent Link:
OCPSTRAT-1542Two Node OpenShift topologies for edge customers

Risk Score:
0

Discussion Needed:

Program Call

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

Edge customers requiring computing on-site to serve business applications (e.g., point of sale, security & control applications, AI inference) are asking for a 2-node HA solution for their environments. Only two nodes at the edge, because the 3d node induces too much cost, but still they need HA for critical workload. To address this need, a 2+1 topology is introduced. It supports a small cheap arbiter node that is remote/virtual to reduce onsite HW cost.

Goals (aka. expected user outcomes)

Support OpenShift on 2+1 topology, meaning two primary nodes with large capacity to run workload and control plan, and a third, remote, small “arbiter” node which ensure quorum. See requirements for more details.

Requirements (aka. Acceptance Criteria):

Remote arbiter node - 3d node running in a remote location (datacenter, cloud). Maximum allowed/supported network latency needs to be clearly documented. Goal: up to 500ms ping time (RTT) / 250ms single trip. Remote arbiter should consume as little ressources as possible, to allow for large scale (e.g. using Hosted Control Plane, containerized control plane)
Arbiter node can be a containerized or virtual host
OCP Virt fully functionally, incl. Live migration of VMs (assuming RWX CSI Driver is available)
Single Node outage is handled seamlessly
In case the remote arbiter node is down/not reachanble , a reboot/restart of the two remaining onsite nodes has to work, i.e. the two remaining nodes re-gain quorum and spin-up the workload.
In double failure mode, i.e. connection to remote arbiter AND one node lost, standard “loss of quorum” behaviour is expected (read only etcd, no cluster state changes, workload keeps running as long as its stable)
Scale out of the cluster by adding additional worker nodes should be possible
Transition the cluster into a regular 3 node compact cluster, e.g. by adding a new node as control plane node, then removing the witness node
Regular workload should not be scheduled to the remote arbiter node. Only essential control plane workload (etcd components) should run on the arbiter node. Non-essential control plan workload (i.e. router, registry, console, monitoring etc) should also not be scheduled to the arbiter nodded.
It must be possible to explicitly schedule additional workload to the arbiter node. That is important for 3d party solutions (e.g. storage provider) which also have quorum based mechanisms.
must seamlessy integrate into existing installation/update mechanismens, esp. zero touch provisioning etc.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	self-managed
Classic (standalone cluster)	yes
Hosted control planes	no
Multi node, Compact (three node), or Single node (SNO), or all	Multi node and Compact (three node)
Connected / Restricted Network	both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	x86_86 and ARM
Operator compatibility	full
Backport needed (list applicable versions)	no
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	no
Other (please specify)	n/a

Questions to Answer (Optional):

How to implement the scheduling restrictions to the arbiter node? New node role “arbiter”?
Can this be delivered in one release, or do we need to split, e.g. TechPreview + GA?

Out of Scope

Storage driver providing RWX shared storage
…

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Two node support is in high demand by telco, industrial and retail customers.
VMWare supports a two node VSan solution: https://core.vmware.com/resource/vsan-2-node-cluster-guide
Example edge hardware frequently used for edge deployments with a co-located small arbiter node: Dell PowerEdge XR4000z Server is an edge computing device that allows restaurants, retailers, and other small to medium businesses to set up local computing for data-intensive workloads.

Customer Considerations

See requirements - there are two main groups of customers: co-located arbiter node, and remote arbiter node.

Documentation Considerations

Topology needs to be documented, esp. The requirements of the arbiter node.

Interoperability Considerations

OCP Virt needs to be explicitly tested on this scenario to support VM HA (live migration, restart on other node)

clones

OCPSTRAT-1500 Two Node OpenShift with Arbiter (TNA) - Tech Preview

Release Pending

Assignee:: Daniel Fröhlich

Reporter:: William Caban

Contributors:: Daniel Fröhlich

Doc Contact:: Matthew Werner

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2024/07/29 1:48 PM

Updated:: 2025/03/25 11:14 PM

Resolved:: 2025/03/25 3:06 PM

Details

Description

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates