Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1543

Support 2+1 node Openshift cluster with Remote Arbiter node (ORA)

XMLWordPrintable

    • Strategic Portfolio Work
    • False
    • Hide

      None

      Show
      None
    • False
    • OCPSTRAT-1542Two Node OpenShift topologies for edge customers
    • 100% To Do, 0% In Progress, 0% Done
    • 0

      Feature Overview (aka. Goal Summary)  

      Edge customers requiring computing on-site to serve business applications (e.g., point of sale, security & control applications, AI inference) are asking for a 2-node HA solution for their environments. Only two nodes at the edge, because the 3d node induces too much cost, but still they need HA for critical workload. To address this need, a 2+1 topology is introduced. It supports a small cheap arbiter node that is  remote/virtual to reduce onsite HW cost. 

      Goals (aka. expected user outcomes)

      Support OpenShift on 2+1 topology, meaning two primary nodes with large capacity to run workload and control plan, and a third, remote, small “arbiter” node which ensure quorum. See requirements for more details.

      Requirements (aka. Acceptance Criteria):

      1. Remote arbiter node - 3d node running in a remote location (datacenter, cloud). Maximum allowed/supported network latency needs to be clearly documented. Goal: up to 500ms ping time (RTT) / 250ms single trip. Remote arbiter should consume as little ressources as possible, to allow for large scale (e.g. using Hosted Control Plane, containerized control plane)
      2. Arbiter node can be a containerized or virtual host
      3. OCP Virt fully functionally, incl. Live migration of VMs (assuming RWX CSI Driver is available)
      4. Single Node outage is handled seamlessly
      5. In case the remote arbiter node is down/not reachanble , a reboot/restart of the two remaining onsite nodes has to work, i.e. the two remaining nodes re-gain quorum and spin-up the workload. 
      6. In double failure mode, i.e. connection to remote arbiter AND one node lost, standard “loss of quorum” behaviour is expected (read only etcd, no cluster state changes, workload keeps running as long as its stable) 
      7. Scale out  of the cluster by adding additional worker nodes should be possible
      8. Transition the cluster into a regular 3 node compact cluster, e.g. by adding a new node as control plane node, then removing the witness node
      9. Regular workload should not be scheduled to the remote arbiter node. Only essential control plane workload (etcd components) should run on the arbiter node. Non-essential control plan workload (i.e. router, registry, console, monitoring etc) should also not be scheduled to the arbiter nodded.
      10. It must be possible to explicitly schedule additional workload to the arbiter node. That is important for 3d party solutions (e.g. storage provider) which also have  quorum based mechanisms.
      11. must seamlessy integrate into existing installation/update mechanismens, esp. zero touch provisioning etc.

       

       

      Deployment considerations List applicable specific needs (N/A = not applicable)
      Self-managed, managed, or both self-managed
      Classic (standalone cluster) yes
      Hosted control planes no
      Multi node, Compact (three node), or Single node (SNO), or all Multi node and Compact (three node)
      Connected / Restricted Network both
      Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_86 and ARM
      Operator compatibility full
      Backport needed (list applicable versions) no
      UI need (e.g. OpenShift Console, dynamic plugin, OCM) no
      Other (please specify) n/a

       

      Questions to Answer (Optional):

      1. How to implement the scheduling restrictions to the arbiter node? New node role “arbiter”?
      2. Can this be delivered in one release, or do we need to split, e.g. TechPreview + GA?

      Out of Scope

      1. Storage driver providing RWX shared storage

       

      Background

      Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

      • Two node support is in high demand by telco, industrial and retail customers.
      • VMWare supports a two node VSan solution: https://core.vmware.com/resource/vsan-2-node-cluster-guide
      • Example edge hardware frequently used for edge deployments with a co-located small arbiter node: Dell PowerEdge XR4000z Server is an edge computing device that allows restaurants, retailers, and other small to medium businesses to set up local computing for data-intensive workloads. 

       

      Customer Considerations

      See requirements - there are two main groups of customers: co-located arbiter node, and remote arbiter node.

       

      Documentation Considerations

      1. Topology needs to be documented, esp. The requirements of the arbiter node.

       

      Interoperability Considerations

      1. OCP Virt needs to be explicitly tested on this scenario to support VM HA (live migration, restart on other node)

       

              dfroehli42rh Daniel Fröhlich
              wcabanba@redhat.com William Caban
              Daniel Fröhlich
              Matthew Werner Matthew Werner
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: