Uploaded image for project: 'Machine Config Operator'
  1. Machine Config Operator
  2. MCO-2103

Image-Mode OpenShift Convergence

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • Image-Mode OpenShift Convergence
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • To Do
    • 0

      Introduction

      This epic represents the ideal state of how to achieve Image-Mode OpenShift Convergence.

      Terminology

      There are a few words and phrases that are sometimes used interchangeably as this topic is discussed both here and elsewhere. This is an attempt to disambiguate them:

      • Non-Layered Update Path: The current default operational mode of the MCD which writes all of the files to the nodes’ filesystem, updates OS images, and makes all of the changes required to the nodes’ live filesystem. In this mode, the MCD behaves like a traditional configuration management system.
      • Image-Mode OpenShift: The use of bootc to apply new configurations and update the OS images on OpenShift cluster nodes in a safe and transactional manner. These OS images can come from the OpenShift release payload, from an off-cluster source, or could be built within the cluster.
      • On-Cluster Layering (OCL) or On-Cluster Builds (OCB): This refers to the idea that an OpenShift cluster should be able to produce its own OS image which contains the current MachineConfigs, extensions, and any admin-provided customizations. The default implementation of this is the Build Controller, described below.
      • Off-Cluster Layering or Off-Cluster Builds: This refers to the idea that the OS image for a given OpenShift instance is built outside of the cluster in a CI system and then the image is deployed to the cluster by setting the OSImageURL field of a MachineConfig.
      • Build Controller: The Build Controller came into existence as the default implementation of on-cluster layering, with the intent of avoiding external dependencies such as OpenShift Image Builder (now deprecated) or OpenShift Pipelines. It is a bespoke image build solution provided and maintained by the MCO team.

      Desired state

      Or, “How will we know that we’ve achieved convergence?” In my opinion, convergence will be considered achieved when the following conditions are met:

      • From Day 0, on-cluster builds are enabled out-of-the-box by default; either by adding the hard requirement of an image registry or by solving the internal image registry problem. Either is acceptable.
      • The non-layered update path will cease to exist beyond a select set of configurations such as SSH keys, certificates, etc. which the MCD is already doing.
      • All MachineConfig changes, extensions, custom configurations and packages will become image layers and be applied to the host as a complete OS image.
      • rpm-ostree is no longer used within OpenShift; either on the nodes or in the on-cluster build process. bootc will be the only way to apply an OS image to a node.
      • There is a singular way to specify what OS image should be installed on a node, such as OSImageURL on MachineConfigs (not sure how this dovetails with dualstream yet).
      • The build process and the image rollout process are decoupled from one another such that the on-cluster build system can effectively delegate the work to another process, if the cluster admin wishes. And once the build is complete, the on-cluster build system will be notified of that fact and will begin installing the new OS image.
      • There is a centralized mechanism for fanning out prebuilt images to large HyperShift cluster tenants, large OpenShift clusters, and large numbers of OpenShift clusters of any size, including single-node (SNO) instances.
      • Operators are capable of supplying their own customized content independently from each other and independently from the cluster admins’ content.

      How to move forward

      The end-goal is that by switching to Image-Mode OpenShift, we can remove the non-layered update for all but a select few updates such as SSH keys and certificates to maintain parity with how OpenShift operates today. Getting there will not be easy because of the dependency graph:

      • We need to adopt bootc, but we can’t fully do that until on-cluster layering is enabled by default.
      • We can’t enable on-cluster layering by default until we improve the default build process.
      • We can’t improve the default build process until we allow cluster operators to use it and make it work across all cluster topologies.

       

      This is the path forward that I can see:

      1. We need to do the bootc experiments in MCO-2071 to determine whether partial adoption is possible. The findings are intended to inform how long we can keep the non-layered update path in existence. Regardless of the outcome, we will need to figure out how to put bootc behind a Feature Gate and there will be the need to declare on-cluster layering as a hard requirement for bootc adoption. This will inform the timeline needed for MCO-1358 to take place.
      2. We should be able to do the work needed to enable OCL by default (MCO-1358). However, we will then need to focus on improving the OS image build process (MCO-1545) to make it more robust and supportable. Bugs and corner-cases we didn’t consider are inevitable. But we need to surface those bugs in a more consistent and visible way.
      3. As we consider the OS image build process improvements, we should take a look at supporting HyperShift (MCO-1173), particularly consolidating the image rollout mechanism and how to allow OCL to communicate with a centralized image build system. We may need to consider the MCD changes as well to support that use-case.

      While we’re improving the OS image build system, we may want to consider how cluster operators can make changes that do not conflict with user-defined configs. A potential solution can be found in RFE-7817, although the work will need to be broken down. Implementing this will be very complex unless we can improve the build process (MCO-1545) before we do so.

              Unassigned Unassigned
              zzlotnik@redhat.com Zack Zlotnik
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: