-
Feature
-
Resolution: Won't Do
-
Major
-
None
-
None
-
Strategic Portfolio Work
-
False
-
-
False
-
OCPSTRAT-1542Two Node OpenShift topologies for edge customers
-
0% To Do, 0% In Progress, 100% Done
-
0
-
Program Call
Feature Overview (aka. Goal Summary)
Customers requiring computing on-site to serve business applications (e.g., point of sale, security & control applications, AI inference) are asking for a 2-node HA solution for their environments. Due to the large number of deployments, they need an automated mechanism to recover quorum, prioritizing service uptime for workloads over data lost. These customers want to avoid investing in a “witness” node, which increases their cost and has identified trade-offs.
The trade-offs across use cases are different, and the reason for the definition of "heuristic profiles".
Goals (aka. expected user outcomes)
In general, the customers requiring this capability are looking for a 2-node Kubernetes cluster capable of automatically promoting a surviving node to recover an etcd quorum even if some data and state are lost.
The following constraints and trade-offs are considered for the design.
Minimum operating environment
- 2-node infrastructure in the same site
- Minimum 16 Cores per node
- Minimum 32GB RAM per node
- Minimum one 200GB SSD or M2 storage per node
- Minimum 2x 10GbE network ports
- Nodes with static network definitions
- Local storage only (e.g., OCP LSO, OCP LVM, hostPath)
- No external storage dependencies
Accepted Constraints
- The cluster will host stateless apps and autonomous stateful apps
- The autonomous stateful apps are capable of automatic recovery from data loss or split-brain scenarios
- The cluster will NOT attempt additional or special-case remediations for stateful apps beyond what is the default Kubernetes behavior
Accepted Trade-offs
- The customer values uptime over data loss
- The customer is okay with losing data and the state of the past 15 minutes
- After recovery of a network split, all the new etcd data for one of the nodes will be lost
- The customer is aware and okay with certain data loss as long as the applications return to an operational state within a specific RTO.
- The RPO is >15 minutes
- Quorum recovery on network split will cause two data loss events
Requirements (aka. Acceptance Criteria):
Functional & Operational Requirements
- Predefined selectable heuristics to select which of the two etcd databases to keep (see Heuristic Profiles)
- In case of a single node outage (e.g., due to failed HW), the surviving node must be able to reboot and restart the workload
- Recovery without manual interaction (not possible at the edge)
- Day 0 operations: can be installed using regular edge installation mechanisms, e.g., ZTP workflows
- Day 1 operations: a failed node can be easily replaced
Heuristic Profiles for designation of new “primary node” and which etcd data to discard:
- Static designation: using a configuration parameter to designate the primary node (e.g., label, annotation, etc.)
- Node uptime: after recovering from a network partition, the node with the highest uptime is designated as the primary node.
- Last save wins: after recovering from a network partition, the node with the newest entry / last update is designated as the primary node.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | self-managed |
Classic (standalone cluster) | yes |
Hosted control planes | no |
Multi node, Compact (three node), or Single node (SNO), or all | no. (this is a new 2-node) |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_86 and ARM |
Operator compatibility | n/a |
Backport needed (list applicable versions) | n/a |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | n/a |
Other (please specify) | n/a |
Use Cases (Optional):
Refer to PRD doc for details (Red Hat internal) https://docs.google.com/document/d/1eTAeRObascRXgGfaxzxvdQrGxDQpEYhxdMtl-FuANKU/edit?usp=sharing
Questions to Answer (Optional):
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
Out of Scope
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Background
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Customer Considerations
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Documentation Considerations
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Interoperability Considerations
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
- is cloned by
-
OCPSTRAT-1500 Support 2+1 node Openshift cluster with Local Arbiter (OLA) - Tech Preview
- In Progress
- is depended on by
-
OCPSTRAT-1311 Allow 2-node control planes in day 1 with Agent-Based Installer
- New