-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
Future Sustainability
-
False
-
-
False
-
None
-
None
-
None
-
None
This card extends the CI coverage for the new rule that got accepted recently.
Progressing indicates that the component (operator and all configured operands) is actively rolling out new code, propagating config changes (e.g, a version change), or otherwise moving from one steady state to another.
where "version change" is used as an explicit example of Progressing.
DoD:
* An test existing in o/origin that covers the above rule.
* (After soaking 1week?) Open bugs for the COs that violates the rule and add them as exceptions for the test.
Context:
Currently, CVO depends on CO's reporting Progressing=True to determine the duration of CO update. In an ideal world, CO goes Progressing=True immediately after CVO bumps the deployment of the operator that manages the CO. The longer the delay is, the more imprecise the begin time of CO update is. So for the begin time to be good enough, CVO needs to
- start the clock when bumping the deployment and figure out which CO it manages
- timeout when CO does not go Progressing=True (which takes a new SLO for COs)
In practice, we hope to use "Fail CI if there are COs that do not report Progressing=True during a cluster upgrade" to cover the above case without having to invest in pinning down a more specific start time outside of CVO.
- CO update duration in CI is small enough (as the whole cluster upgrade is quick) and thus seeing it go Progressing=True at all is close enough to meet the SLO.
- CVO current internal timer of "first time I hoped the CO would show up with the new version" is pretty solid except "co/machine-config" (as CVO is restarted while machine-config goes Progressing). However, we are currently happy with machine-config's Progressing.
—
Update:
It turned out that there are quite a few COs that do not report Progressing=True during cluster upgrade.
CVO (probably our users as well) cares only for the case where CO update takes long.
Do we want to file a bug if a CO does not go Progressing whose update finishes really quickly?
Do we want to change the rule to "CO must report Progressing=True if its version bump takes longer than n minutes"?