Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1316

[GA] Simplify and unify adding nodes to clusters on day 2

XMLWordPrintable

    • BU Product Work
    • False
    • Hide

      None

      Show
      None
    • False
    • 0% To Do, 0% In Progress, 100% Done
    • 0
    • Program Call
    • With any unification enablement for CEE should be provided to help them change how they support customers.

      Feature Overview

      Adding nodes to on-prem clusters in OpenShift in general is a complex task. We have numerous methods and the field keeps adding automation around these methods with a variety of solutions, sometimes unsupported (see "why is this important below"). Making cluster expansions easier will let users add nodes often and fast, leading to an much improved UX.

      This feature adds nodes to any on-prem clusters, regardless of their installation method (UPI, IPI, Assisted, Agent), by booting an ISO image that will add the node to the cluster specified by the user, regardless of how the cluster was installed.

      Goals and requirements

      • Users can install a host on day 2 using a bootable image to an OpenShift cluster.
      • At least platforms baremetal, vSphere, none and Nutanix are supported
      • Clusters installed with any installation method can be expanded with the image
      • Clusters don't need to run any special agent to allow the new nodes to join.

      How this workflow could look like

      1. Create image:

      $ export KUBECONFIG=kubeconfig-of-target-cluster
      $ oc adm node-image -o agent.iso --network-data=worker-n.nmstate --role=worker
      

      2. Boot image

      3. Check progress

      $ oc adm add-node 

      Consolidate options

      An important goal of this feature is to unify and eliminate some of the existing options to add nodes, aiming to provide much simpler experience (See "Why is this important below"). We have official and field-documented ways to do this, that could be removed once this feature is in place, simplifying the experience, our docs and the maintenance of said official paths:

      • UPI: Adding RHCOS worker nodes to a user-provisioned infrastructure cluster
        • This feature will replace the need to use this method for the majority of UPI clusters. The current UPI method consists on many many manual steps. The new method would replace it by a couple of commands and apply to probably more than 90% of UPI clusters.
      • Field-documented methods and asks
        • Often we are asked about ways to do this or given different ways in which the field is automating this process in their own way. We can't control all aspects of these automations or how many there are, they are usually based on UPI, e.g.
        • [gellner/expand-agent1.md|https://gist.github.com/gellner/f1f2928f847355ae80d0867884569109
        • WKLD-433
      • IPI:
        • There are instances were adding a node to an bare metal IPI-deployed cluster can't be done via its BMC. This new feature, while not replacing the day-2 IPI workflow, solves the problem for this use case.
      • MCE: Scaling hosts to an infrastructure environment
        • This method is the most time-consuming and in many cases overkilling, but currently, along with the UPI method, is one of the two options we can give to users.
        • We shouldn't need to ask users to install and configure the MCE operator and its infrastructure for single clusters as it becomes a project even larger than UPI's method and save this for when there's more than one cluster to manage.

      With this proposed workflow we eliminate the need of using the UPI method in the vast majority of the cases. We also eliminate the field-documented methods that keep popping up trying to solve this in multiple formats, and the need to recommend using MCE to all on-prem users, and finally we add a simpler option for IPI-deployed clusters.

      In addition, all the built-in validations in the assisted service would be run, improving the installation the success rate and overall UX.

      This work would have an initial impact on bare metal, vSphere, Nutanix and platform-agnostic clusters, regardless of how they were installed.

      Why is this important

      This feature is essential for several reasons. Firstly, it enables easy day2 installation without burdening the user with additional technical knowledge. This simplifies the process of scaling the cluster resources with new nodes, which today is overly complex and presents multiple options (https://docs.openshift.com/container-platform/4.13/post_installation_configuration/cluster-tasks.html#adding-worker-nodes_post-install-cluster-tasks).

      Secondly, it establishes a unified experience for expanding clusters, regardless of their installation method. This streamlines the deployment process and enhances user convenience.

      Another advantage is the elimination of the requirement to install the Multicluster Engine and Infrastructure Operator , which besides demanding additional system resources, are an overkill for use cases where the user simply wants to add nodes to their existing cluster but aren't managing multiple clusters yet. This results in a more efficient and lightweight cluster scaling experience.

      Additionally, in the case of IPI-deployed bare metal clusters, this feature eradicates the need for nodes to have a Baseboard Management Controller (BMC) available, simplifying the expansion of bare metal clusters.

      Lastly, this problem is often brought up in the field, where examples of different custom solutions have been put in place by redhatters working with customers trying to solve the problem with custom automations, adding to inconsistent processes to scale clusters. 

      Oracle Cloud Infrastructure

      This feature will solve the problem cluster expansion for OCI. OCI doesn't have MAPI and CAPI isn't in the mid term plans. Mitsubishi shared their feedback making solving the problem of lack of cluster expansion a requirement to Red Hat and Oracle.

      Existing work

      We already have the basic technologies to do this with the assisted-service and the agent-based installer, which already do this work for new clusters, and from which we expect to leverage the foundations for this feature.

      Day 2 node addition with agent image.

      Yet Another Day 2 Node Addition Commands Proposal

      Enable day2 add node using agent-install: AGENT-682

       

              racedoro@redhat.com Ramon Acedo
              racedoro@redhat.com Ramon Acedo
              Andrea Fasano
              Andrea Fasano Andrea Fasano
              Pedro Jose Amoedo Martinez Pedro Jose Amoedo Martinez
              Stephanie Stout Stephanie Stout
              Zane Bitter Zane Bitter
              Ramon Acedo Ramon Acedo
              Eric Rich Eric Rich
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: