Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5505

Upgradeability check is throttled too much and with unnecessary non-determinism

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • 4.11.z
    • None
    • Moderate
    • None
    • 5
    • OTA 230
    • 1
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The upgradeability check in CVO is throttled (essentially cached) for a nondeterministic period of time, same as the minimal sync period computed at runtime. The period can be up to 4 minutes, determined at CVO start time as 2minutes * (0..1 + 1). We agreed with Trevor that such throttling is unnecessarily aggressive (the check is not that expensive). It also causes CI flakes, because the matching test only has 3 minutes timeout. Additionally, the non-determinism and longer throttling results makes UX worse by actions done in the cluster may have their observable effect delayed.

      Version-Release number of selected component (if applicable):

      discovered in 4.10 -> 4.11 upgrade jobs

      How reproducible:

      The test seems to flake ~10% of 4.10->4.11 Azure jobs (sippy). There does not seem to be that much impact on non-Azure jobs though which is a bit weird.

      Steps to Reproduce:

      Inspect the CVO log and E2E logs from failing jobs with the provided check-cvo.py helper:

      $ ./check-cvo.py cvo.log && echo PASS || echo FAIL
      

      Preferably, inspect CVO logs of clusters that just underwent an upgrade (upgrades makes the original problematic behavior more likely to surface)

      Actual results:

      $ ./check-cvo.py openshift-cluster-version_cluster-version-operator-5b6966c474-g4kwk_cluster-version-operator.log && echo PASS || echo FAIL
      FAIL: Cache hit at 11:59:55.332339 0:03:13.665006 after check at 11:56:41.667333
      FAIL: Cache hit at 12:06:22.663215 0:03:13.664964 after check at 12:03:08.998251
      FAIL: Cache hit at 12:12:49.997119 0:03:13.665598 after check at 12:09:36.331521
      FAIL: Cache hit at 12:19:17.328510 0:03:13.664906 after check at 12:16:03.663604
      FAIL: Cache hit at 12:25:44.662290 0:03:13.666759 after check at 12:22:30.995531
      Upgradeability checks:           5
      Upgradeability check cache hits: 12
      FAIL
      

      Note that the bug is probabilistic, so not all unfixed clusters will exhibit the behavior. My guess of the incidence rate is about 30-40%.

      Expected result

      $ ./check-cvo.py openshift-cluster-version_cluster-version-operator-7b8f85d455-mk9fs_cluster-version-operator.log && echo PASS || echo FAIL
      Upgradeability checks:           12
      Upgradeability check cache hits: 11
      PASS
      

      The actual numbers are not relevant (unless the upgradeabilily check count is zero, which means the test is not conclusive, the script warns about that), lack of failure is.

      Additional info:

      $ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b7d4b5bbd-zjqdt_cluster-version-operator.log | grep upgradeable.go
      ...
      I1227 06:50:59.023190       1 upgradeable.go:122] Cluster current version=4.10.46
      I1227 06:50:59.042735       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
      I1227 06:51:14.024345       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
      I1227 06:53:23.080768       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
      I1227 06:56:59.366010       1 upgradeable.go:122] Cluster current version=4.11.0-0.ci-2022-12-26-193640
      
      $ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Kubernetes 1.25 and therefore OpenShift 4.12'
      Dec 27 06:51:15.319: INFO: Waiting for Upgradeable to be AdminAckRequired for "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions." ...
      Dec 27 06:54:15.413: FAIL: Error while waiting for Upgradeable to complain about AdminAckRequired with message "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.": timed out waiting for the condition
      
      The test passes. Also, the "Upgradeable conditions were recently checked, will try later." messages in CVO logs should never occur after a deterministic, short amount of time (I propose 1 minute) after upgradeability was checked.
      

      I tested the throttling period in https://github.com/openshift/cluster-version-operator/pull/880. With the period of 15m, the test passrate was 4 of 9. Wiht the period of 1m, the test did not fail at all.

      Some context in Slack thread

            afri@afri.cz Petr Muller
            afri@afri.cz Petr Muller
            Evgeni Vakhonin Evgeni Vakhonin
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: