Loading...

Type: Spike
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
tech debt catchall/intake
Intelligence Requested:
Market:

Cost of Delay:
0
WSJF:
0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

I spent a bit of shift week investigating this, the following were my findings. I'll keep working on this as I get time, but I wanted to put my current progress somewhere.
I'll admit some of this might be a little stream of consciousness-ish, I'll clean it up as I go along...I promise (:

This function is one of the main syncs in the operator. It has a 10 minute timeout minimum(maximum upto number of required nodes*10 minutes).
It is used to make sure that the required pools are in sync.

Current Loop structure:

sync metrics
propogate last error to cluster operator via status extension. This is done immediately, basically to report "we're in a transition or a bad state"
step through pools
- if any of the pools are degraded, call syncUpgradeableStatus(), return err via lastErr mechanism.
- if a required pool, we do the following
  --> osImageURL/MCP valid check(more details below). if this fails, call syncUpgradeableStatus(), return err via lastErr mechanism
  --> check status generation >= spec generation & if pool is updated. If this evaluatees to true, we mark the pool as updated and skip the iteration of the loop.
  if this fails, call syncUpgradeableStatus(), return err via lastErr mechanism.
  --> if pool is paused, break out of the timed loop and report error immediately
- if we get past this inner "pool" loop, the update is complete and we can exit the wait successfully
if we timed out in this loop, we aggregate the error and return it from the sync function.
if we get out of the timed loop with no error, we return nil, indicating success

osImageURL/MCP valid check:
-Make sure both pool.Status.Configuration.Name & pool.Spec.Configuration.Name are not empty
-Make sure the number of sources in status are not empty...len(pool.Status.Configuration.Source) != 0
-Put all the machine config sources and the final generated machine config itself in a slice
-Iterate through the slice, checking for
--verify that the generated machine config(pool.Status.Configuration.Name) in this slice was created by the controller version passed in to this function.
This version passed in is compared to an annotation on the generated machine config. If the MC is not generated, this check is skipped. If the generated machine
config does not have an annotation, we pop an error. If this version match fails, we pop an error. The version used here from is from the version package in our repo
("github.com/openshift/machine-config-operator/pkg/version")
--To be noted...the no annotation check does not apply to bootstrap fragments(wat?why?)
--This loop does not do any checks on user generated configs
-compare generated config(pool.Status.Configuration.Name)'s annotated osImageURL against the osURL(obtained from config map) passed into the function. The osImageURL check is only relevant
if the user hasn't overriden the osImageURL.
-compare the releaseVersion annotated from the generated machine config against the releaseVersion passed into the function(this is from the operator's version store). If this fails, we pop
an error; if the annotation fetch fails, we pop another error.

syncUpgradeableStatus():
This checks all the pool conditions and then marks the cluster operator degraded if any of the pools are degraded.

Plans:
-change the waitwithContextTimeout to a wait with cancel and a manual timer that we can manipulate based on number of nodes.
-move getOsImageURLs inside isMachineConfigPoolConfigurationValid
-cleaner calling of syncUpgradeableStatus()
-Make it more obvious when we are making progress(continue when a master pool succeeds is a little unclear)
-Redo comment inside osImageURL/MCP valid function(atleast the header, since its a little confusing)
-Update/More comments in the main function explaining why certain checks exist

More wild plans:
-breakout an update syncloop and a steady state sync loop for required pools???

relates to

OCPBUGS-4820 Controller version mismatch causing degradation during upgrades

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates