Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1693

Support 2+1 node Openshift cluster with Local Arbiter (OLA) - GA

XMLWordPrintable

    • Strategic Product Work
    • False
    • Hide

      None

      Show
      None
    • False
    • OCPSTRAT-1542Two Node OpenShift topologies for edge customers
    • 100% To Do, 0% In Progress, 0% Done
    • XL
    • 0

      Feature Overview (aka. Goal Summary)  

      Edge customers requiring computing on-site to serve business applications (e.g., point of sale, security & control applications, AI inference) are asking for a 2-node HA solution for their environments. Only two nodes at the edge, because the 3d node induces too much cost, but still they need HA for critical workload. To address this need, a 2+1 topology is introduced. It supports a small cheap arbiter node that can optionally be remote/virtual to reduce onsite HW cost. 

      Goals (aka. expected user outcomes)

      Support OpenShift on 2+1 topology, meaning two primary nodes with large capacity to run workload and control plan, and a third small “arbiter” node which ensure quorum. See requirements for more details.

      Requirements (aka. Acceptance Criteria):

      1. Co-located arbiter node -  3d node in same network/location with low latency network access, but the arbiter node is much smaller compared to the two main nodes. Target resource requirements for the arbiter node: 4 cores / 8 vcpu, 16G RAM, 120G disk (non-spinning), 1x1 GbE network ports, no BMC
      2. OCP Virt fully functionally, incl. Live migration of VMs (assuming RWX CSI Driver is available)
      3. Single Node outage is handled seamlessly
      4. In case the arbiter node is down , a reboot/restart of the two remaining nodes has to work, i.e. the two remaining nodes re-gain quorum and spin-up the workload. 
      5. Scale out  of the cluster by adding additional worker nodes should be possible
      6. Transition the cluster into a regular 3 node compact cluster, e.g. by adding a new node as control plane node, then removing the arbiter node, should be possible
      7. Regular workload should not be scheduled to the arbiter node (e.g by making it un-schedulabe, or introduce a new node role “arbiter”). Only essential control plane workload (etcd components) should run on the arbiter node. Non-essential control plan workload (i.e. router, registry, console, monitoring etc) should also not be scheduled to the arbiter nodded.
      8. It must be possible to explicitly schedule additional workload to the arbiter node. That is important for 3d party solutions (e.g. storage provider) which also have  quorum based mechanisms.
      9. must seamlessly integrate into existing installation/update mechanisms, esp. zero touch provisioning etc.
      10. Added: ability to track OLA usage in the fleet of connected clusters via OCP telemetry data

       

       

      Deployment considerations List applicable specific needs (N/A = not applicable)
      Self-managed, managed, or both self-managed
      Classic (standalone cluster) yes
      Hosted control planes no
      Multi node, Compact (three node), or Single node (SNO), or all Multi node and Compact (three node)
      Connected / Restricted Network both
      Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_86 and ARM
      Operator compatibility full
      Backport needed (list applicable versions) no
      UI need (e.g. OpenShift Console, dynamic plugin, OCM) no
      Other (please specify) n/a

       

      Questions to Answer (Optional):

      1. How to implement the scheduling restrictions to the arbiter node? New node role “arbiter”?
      2. Can this be delivered in one release, or do we need to split, e.g. TechPreview + GA?

      Out of Scope

      1. Storage driver providing RWX shared storage

       

      Background

      Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

      • Two node support is in high demand by telco, industrial and retail customers.
      • VMWare supports a two node VSan solution: https://core.vmware.com/resource/vsan-2-node-cluster-guide
      • Example edge hardware frequently used for edge deployments with a co-located small arbiter node: Dell PowerEdge XR4000z Server is an edge computing device that allows restaurants, retailers, and other small to medium businesses to set up local computing for data-intensive workloads. 

       

      Customer Considerations

      See requirements - there are two main groups of customers: co-located arbiter node, and remote arbiter node.

       

      Documentation Considerations

      1. Topology needs to be documented, esp. The requirements of the arbiter node.

       

      Interoperability Considerations

      1. OCP Virt needs to be explicitly tested on this scenario to support VM HA (live migration, restart on other node)

       

              dfroehli42rh Daniel Fröhlich
              wcabanba@redhat.com William Caban
              Daniel Fröhlich, Nick Carboni, Oved Ourfali, Thomas Jungblut
              Chad Scribner Chad Scribner
              Matthew Werner Matthew Werner
              Jeremy Peterson Jeremy Peterson
              Egli Hila Egli Hila
              Daniel Fröhlich Daniel Fröhlich
              John Long John Long
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: