Uploaded image for project: 'Machine Config Operator'
  1. Machine Config Operator
  2. MCO-653

Investigate refactoring syncRequiredMachineConfigPools

    XMLWordPrintable

Details

    • Spike
    • Resolution: Unresolved
    • Normal
    • None
    • None
    • None
    • 0
    • 0

    Description

      I spent a bit of shift week investigating this, the following were my findings. I'll keep working on this as I get time, but I wanted to put my current progress somewhere.  
      I'll admit some of this might be a little stream of consciousness-ish, I'll clean it up as I go along...I promise (: 

       
      This function is one of the main syncs in the operator. It has a 10 minute timeout minimum(maximum upto number of required nodes*10 minutes).
      It is used to make sure that the required pools are in sync.

      Current Loop structure:

      • sync metrics
      • propogate last error to cluster operator via status extension. This is done immediately, basically to report "we're in a transition or a bad state"
      • step through pools
        • if any of the pools are degraded, call syncUpgradeableStatus(), return err via lastErr mechanism.
        • if a required pool, we do the following
          --> osImageURL/MCP valid check(more details below). if this fails, call syncUpgradeableStatus(), return err via lastErr mechanism
          --> check status generation >= spec generation & if pool is updated. If this evaluatees to true, we mark the pool as updated and skip the iteration of the loop.
          if this fails, call syncUpgradeableStatus(), return err via lastErr mechanism.
          --> if pool is paused, break out of the timed loop and report error immediately
        • if we get past this inner "pool" loop, the update is complete and we can exit the wait successfully
      • if we timed out in this loop, we aggregate the error and return it from the sync function.
      • if we get out of the timed loop with no error, we return nil, indicating success

      osImageURL/MCP valid check:
      -Make sure both pool.Status.Configuration.Name & pool.Spec.Configuration.Name are not empty
      -Make sure the number of sources in status are not empty...len(pool.Status.Configuration.Source) != 0
      -Put all the machine config sources and the final generated machine config itself in a slice
      -Iterate through the slice, checking for
      --verify that the generated machine config(pool.Status.Configuration.Name) in this slice was created by the controller version passed in to this function.
      This version passed in is compared to an annotation on the generated machine config. If the MC is not generated, this check is skipped. If the generated machine
      config does not have an annotation, we pop an error. If this version match fails, we pop an error. The version used here from is from the version package in our repo
      ("github.com/openshift/machine-config-operator/pkg/version")
      --To be noted...the no annotation check does not apply to bootstrap fragments(wat?why?)
      --This loop does not do any checks on user generated configs
      -compare generated config(pool.Status.Configuration.Name)'s annotated osImageURL against the osURL(obtained from config map) passed into the function. The osImageURL check is only relevant
      if the user hasn't overriden the osImageURL.
      -compare the releaseVersion annotated from the generated machine config against the releaseVersion passed into the function(this is from the operator's version store). If this fails, we pop
      an error; if the annotation fetch fails, we pop another error.

      syncUpgradeableStatus():
      This checks all the pool conditions and then marks the cluster operator degraded if any of the pools are degraded.

      Plans:
      -change the waitwithContextTimeout to a wait with cancel and a manual timer that we can manipulate based on number of nodes.
      -move getOsImageURLs inside isMachineConfigPoolConfigurationValid
      -cleaner calling of syncUpgradeableStatus()
      -Make it more obvious when we are making progress(continue when a master pool succeeds is a little unclear)
      -Redo comment inside osImageURL/MCP valid function(atleast the header, since its a little confusing)
      -Update/More comments in the main function explaining why certain checks exist

      More wild plans:
      -breakout an update syncloop and a steady state sync loop for required pools???
       

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              djoshy David Joshy
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: