-
Feature
-
Resolution: Unresolved
-
Critical
-
None
-
Product / Portfolio Work
-
-
0% To Do, 0% In Progress, 100% Done
-
False
-
-
False
-
XL
-
None
-
-
-
-
-
-
None
-
None
Feature Overview (aka. Goal Summary)Â Â
Customers with large numbers of geographically dispersed locations want a container management solution with a two node footprint. They require high availability but even "cheap" third nodes represent a significant cost at this scale. Â
Goals (aka. expected user outcomes)
Two-node clustering is a solved problem in the traditional HA space. The goal of this feature is to introduce existing RHEL technologies into OpenShift to support a true two-node topology. This requires fencing to ensure node recovery. Hence the name: Two Node OpenShift with Fencing (TNF).
Requirements (aka. Acceptance Criteria):
- Provide a true two node OCP deployment
- Support workload in active/passive mode..i.e.. single instance pod where the pods from the failed node are restarted on the 2nd node in a timely manner, or a 2nd pod is already running but passive, ready to take over if the 1st pod fails (e.g.: psql database in an active/passive setup). This sees CPU utilisation ~50% max.
- Support workload in active/active workload. Both nodes are load sharing and they are loaded by design to be about 60-75% at full capacity - during failure there is an expectation of service degradation but not service down completely - So if one node fails the other node operates at close to 100%Â
- Either both nodes have a fencing device (BMC via redfish, IPMI etc, UPS via serial port),
or there is a dedicated direct cross over cable between the nodes to drastically reduce the risk of split brain.BMC via redfish at TP only, other fencing devices probably post-GA. - <60s failover time: if the leading node goes down, the remaining nodes takes over and gains operational state (writable) in less then 60s. Exact parameters (heartbeat interval, missed heartbeats etc. needs to be configurable by users, e.g. to operate on a less aggressive timeline if required (avoid unnecessary failovers due to blip/flukes). (To be refined after initial numbers observed during TP testing)Â
- No shared storage available between nodes required as fencing device.Â
Be able to scale out to a true three node compact cluster as day2 operation. (Stretch goal, not required for MVP, but constraint to be kept in mind during design and implementation). The resulting cluster should have 3 node etcd quorum, and the same architecture/support statement as a freshly installed 3 node compact cluster.Out of scope for TP, and probably even for GA, as OCP currently does not control plane topology changes.Be able to add worker nodes to a two node cluster with fencing as day2 operation. Like we do support with SNO+worker nodes(stretch goal, no required for TP or GA)- Solution fullfills the[ k8s-etcd contract|https://docs.google.com/document/d/1NUZDiJeiIH5vo_FMaTWf0JtrQKCx0kpEaIIuPoj9P6A/edit#heading=h.tlkin1a8b8bl], so that layer mechanism like Leases work correctly.Â
- support full recovery of the workload when the node comes back online after restoration - total time <15 mins
- X86_64 only on initial release, AARCH64 might be added later.Â
- Added: ability to track TNF usage in the fleet of connected clusters via OCP telemetry data (e.g. number of clusters with TNF topology)
Added: Be able to install OCP Virt and run VMs with node local storage (e.g. LSO or LVMS) on both nodes. Deferred to GAÂ
Â
Â
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Self managed |
Classic (standalone cluster) | yes |
Hosted control planes | n/a |
Multi node, Compact (three node), or Single node (SNO), or all | NEW: Two Node with Fencing |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | X86, arm |
Operator compatibility | full |
Backport needed (list applicable versions) | no |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | none |
Other (please specify) | Â |
Â
Questions to Answer (Optional):
- ...
 Out of Scope
- Storage driver providing RWX shared storage
- ...
 Background
- Two node support is in high demand by telco, industrial and retail customers.
- StarlingX supports true two node (docs)
 Customer Considerations
Telco Customer requirements:
2-node HA control-plane requirements for Telco
Â
Documentation Considerations
Topology needs to be documented, esp. The requirements of the arbiter node.
Â
Interoperability Considerations
- OCP Virt needs to be explicitly tested on this scenario to support VM HA (live migration, restart on other node)
Â
- is cloned by
-
OCPSTRAT-1551 Two Node OpenShift with Fencing (TNF) - GA
-
- In Progress
-
- links to