Description of problem:
The upgradeability check in CVO is throttled (essentially cached) for a nondeterministic period of time, same as the minimal sync period computed at runtime. The period can be up to 4 minutes, determined at CVO start time as 2minutes * (0..1 + 1). We agreed with Trevor that such throttling is unnecessarily aggressive (the check is not that expensive). It also causes CI flakes, because the matching test only has 3 minutes timeout. Additionally, the non-determinism and longer throttling results makes UX worse by actions done in the cluster may have their observable effect delayed.
Version-Release number of selected component (if applicable):
discovered in 4.10 -> 4.11 upgrade jobs
The test seems to flake ~10% of 4.10->4.11 Azure jobs (sippy). There does not seem to be that much impact on non-Azure jobs though which is a bit weird.
Steps to Reproduce:
Inspect the CVO log
and E2E logs from failing jobs with the provided check-cvo.py helper:
$ ./check-cvo.py cvo.log && echo PASS || echo FAIL
Preferably, inspect CVO logs of clusters that just underwent an upgrade (upgrades makes the original problematic behavior more likely to surface)
$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-5b6966c474-g4kwk_cluster-version-operator.log && echo PASS || echo FAIL
FAIL: Cache hit at 11:59:55.332339 0:03:13.665006 after check at 11:56:41.667333
FAIL: Cache hit at 12:06:22.663215 0:03:13.664964 after check at 12:03:08.998251
FAIL: Cache hit at 12:12:49.997119 0:03:13.665598 after check at 12:09:36.331521
FAIL: Cache hit at 12:19:17.328510 0:03:13.664906 after check at 12:16:03.663604
FAIL: Cache hit at 12:25:44.662290 0:03:13.666759 after check at 12:22:30.995531
Upgradeability checks: 5
Upgradeability check cache hits: 12
Note that the bug is probabilistic, so not all unfixed clusters will exhibit the behavior. My guess of the incidence rate is about 30-40%.
$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-7b8f85d455-mk9fs_cluster-version-operator.log && echo PASS || echo FAIL
Upgradeability checks: 12
Upgradeability check cache hits: 11
The actual numbers are not relevant (unless the upgradeabilily check count is zero, which means the test is not conclusive, the script warns about that), lack of failure is.
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b7d4b5bbd-zjqdt_cluster-version-operator.log | grep upgradeable.go
I1227 06:50:59.023190 1 upgradeable.go:122] Cluster current version=4.10.46
I1227 06:50:59.042735 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:51:14.024345 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:53:23.080768 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:56:59.366010 1 upgradeable.go:122] Cluster current version=4.11.0-0.ci-2022-12-26-193640
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Kubernetes 1.25 and therefore OpenShift 4.12'
Dec 27 06:51:15.319: INFO: Waiting for Upgradeable to be AdminAckRequired for "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions." ...
Dec 27 06:54:15.413: FAIL: Error while waiting for Upgradeable to complain about AdminAckRequired with message "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.": timed out waiting for the condition
The test passes. Also, the "Upgradeable conditions were recently checked, will try later." messages in CVO logs should never occur after a deterministic, short amount of time (I propose 1 minute) after upgradeability was checked.
I tested the throttling period in https://github.com/openshift/cluster-version-operator/pull/880. With the period of 15m, the test passrate was 4 of 9. Wiht the period of 1m, the test did not fail at all.
Some context in Slack thread