Uploaded image for project: 'Machine Config Operator'
  1. Machine Config Operator
  2. MCO-985

Revisit Config Drift in On Cluster Build context

XMLWordPrintable

    • Icon: Spike Spike
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • None
    • 0
    • 0

      Problem Statement:

      Does Config Drift Monitor still make sense in a post image-mode RHEL and on-cluster layering world? If it still makes sense, how and where does it fit in?

      Background:

      A common pattern with general-purpose config management system such as Ansible or Chef is that configurations are typically written and rewritten to a node fleet on a periodic basis. To wit: I helped manage a fleet of machines using Chef before I joined Red Hat. Each of our nodes would run Chef and reapply of all of its configurations once per hour. It was still possible that config drift can occur due to debugging things.

      By comparison, the MCO does not do this. The reason is because the MCO is a special-purpose config management system whose primary objective is managing nodes that run an immutable(-ish) OS. Instead, it waits for a new MachineConfig to be ready and drains the node, writes the files, and in most cases reboots the node. However, like Ansible and Chef, the MCO only cares about a subset of files on the machine; the ones that it manages or is somehow aware of. Consequently, undesired behaviors can occur when a cluster admin manually writes contents to a file that the MCO manages. This may be desirable in situations where, say, a cluster admin needs to debug something.

      What problem does Config Drift Monitor solve?

      In the past, the MCO would perform a preflight check before rolling out a new config by comparing the contents of the files it manages to the contents of the old config. If the actual contents differed from the expected contents, the node would be marked degraded and the config rollout would halt. The problem becomes that a cluster admin might need to manually make a config change and forget to change it back. Several weeks or months pass. The cluster admin then attempts to apply a new MachineConfig or upgrade their cluster, only to discover that the process is blocked because one of the nodes has degraded.

      Enter Config Drift Monitor. By using fsnotify to watch the files that the MCO manages for write events, Config Drift Monitor can detect deviations from the expected content within seconds, degrade the node, and notify the cluster admin. The node will stay degraded until the cluster admin reverts the contents back to what the MCO expects or the cluster admin tells the MCO to forcefully overwrite the contents.

      All of this makes sense in world where there is a rather finite set of files to manage and we know what their contents are supposed to be. Image-mode RHEL and on-cluster layering can fundamentally turn this idea on its head. Consequently, does Config Drift Monitor still make sense in this world?

       

      Done When:

      • We have an idea of where Config Drift Monitor fits into the picture.
      • Appropriate Jira cards are opened to track any work that is needed to make Config Drift Monitor better align with this new future and paradigm.

              Unassigned Unassigned
              mkrejci-1 Michelle Krejci
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: