Loading...

XML

Word

Printable

Type: Feature
Resolution: Won't Do
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: etcd
Labels:
- edge
- ocpedge-plan

Work Type:
Strategic Portfolio Work
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Parent Link:
OCPSTRAT-1542Two Node OpenShift topologies for edge customers
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done

Risk Score:
0

Discussion Needed:

Program Call

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

Customers requiring computing on-site to serve business applications (e.g., point of sale, security & control applications, AI inference) are asking for a 2-node HA solution for their environments. Due to the large number of deployments, they need an automated mechanism to recover quorum, prioritizing service uptime for workloads over data lost. These customers want to avoid investing in a “witness” node, which increases their cost and has identified trade-offs.

The trade-offs across use cases are different, and the reason for the definition of "heuristic profiles".

Goals (aka. expected user outcomes)

In general, the customers requiring this capability are looking for a 2-node Kubernetes cluster capable of automatically promoting a surviving node to recover an etcd quorum even if some data and state are lost.

The following constraints and trade-offs are considered for the design.

Minimum operating environment

2-node infrastructure in the same site
Minimum 16 Cores per node
Minimum 32GB RAM per node
Minimum one 200GB SSD or M2 storage per node
Minimum 2x 10GbE network ports

Nodes with static network definitions
Local storage only (e.g., OCP LSO, OCP LVM, hostPath)
No external storage dependencies

Accepted Constraints

The cluster will host stateless apps and autonomous stateful apps
The autonomous stateful apps are capable of automatic recovery from data loss or split-brain scenarios

The cluster will NOT attempt additional or special-case remediations for stateful apps beyond what is the default Kubernetes behavior

Accepted Trade-offs

The customer values uptime over data loss
- The customer is okay with losing data and the state of the past 15 minutes
- After recovery of a network split, all the new etcd data for one of the nodes will be lost
The customer is aware and okay with certain data loss as long as the applications return to an operational state within a specific RTO.
The RPO is >15 minutes
Quorum recovery on network split will cause two data loss events

Requirements (aka. Acceptance Criteria):

Functional & Operational Requirements

Predefined selectable heuristics to select which of the two etcd databases to keep (see Heuristic Profiles)
In case of a single node outage (e.g., due to failed HW), the surviving node must be able to reboot and restart the workload
Recovery without manual interaction (not possible at the edge)
Day 0 operations: can be installed using regular edge installation mechanisms, e.g., ZTP workflows
Day 1 operations: a failed node can be easily replaced

Heuristic Profiles for designation of new “primary node” and which etcd data to discard:

Static designation: using a configuration parameter to designate the primary node (e.g., label, annotation, etc.)
Node uptime: after recovering from a network partition, the node with the highest uptime is designated as the primary node.
Last save wins: after recovering from a network partition, the node with the newest entry / last update is designated as the primary node.

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	self-managed
Classic (standalone cluster)	yes
Hosted control planes	no
Multi node, Compact (three node), or Single node (SNO), or all	no. (this is a new 2-node)
Connected / Restricted Network	both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	x86_86 and ARM
Operator compatibility	n/a
Backport needed (list applicable versions)	n/a
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	n/a
Other (please specify)	n/a

Use Cases (Optional):

Refer to PRD doc for details (Red Hat internal) https://docs.google.com/document/d/1eTAeRObascRXgGfaxzxvdQrGxDQpEYhxdMtl-FuANKU/edit?usp=sharing

Questions to Answer (Optional):

~~Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.~~

~~<your text here>~~

Out of Scope

~~High-level list of items that are out of scope. Initial completion during Refinement status.~~

~~<your text here>~~

Background

~~Provide any additional context is needed to frame the feature. Initial completion during Refinement status.~~

~~<your text here>~~

Customer Considerations

~~Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.~~

~~<your text here>~~

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

~~<your text here>~~

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

~~<your text here>~~

is cloned by

OCPSTRAT-1500 Two Node OpenShift with Arbiter (TNA) - Tech Preview

In Progress

is depended on by

OCPSTRAT-1311 Allow 2-node control planes in day 1 with Agent-Based Installer

Assignee:: William Caban

Reporter:: William Caban

Need Info From:: Daniel Fröhlich

Contributors:: Daniel Fröhlich

Doc Contact:: Matthew Werner

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2024/04/02 3:56 PM

Updated:: 2024/12/06 9:33 PM

Resolved:: 2024/07/08 10:51 PM

Details

Description

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates