Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-9108

openshift-tests-upgrade.[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available

XMLWordPrintable

    • Moderate
    • None
    • MCO Sprint 250, MCO Sprint 251
    • 2
    • Rejected
    • Unspecified
    • N/A
    • Release Note Not Required
    • In Progress

      Similar to bug 1955300, but seen in a recent 4.11-to-4.11 update [1]:

      : [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
      Run #0: Failed expand_less 47m16s
      1 unexpected clusteroperator state transitions during e2e test run

      Feb 05 22:15:40.430 - 1044s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

      {operator 4.11.0-0.nightly-2022-02-05-152519}]

      Not something that breaks the update, but still something that sounds pretty alarming, and which had ClusterOperatorDown over its 10m threshold [2]. In this case, the alert did not fire because the metric-serving CVO moved from one instance to another as the MCO was rolling the nodes, and we haven't attempted anything like [3] yet to guard ClusterOperatorDown against that sort of label drift.

      Over the past 24h, seems like a bunch of hits across several versions, all for 17+ minutes:

      $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&context=0&search=s+E+clusteroperator/machine-config+condition/Available+status/False' | jq -r 'to_entries[].value |
      to_entries[].value[].context[]'
      Feb 05 22:19:52.700 - 2038s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.8.0-0.nightly-s390x-2022-02-05-125308
      Feb 05 18:57:26.413 - 1470s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-05-143255}]
      Feb 05 22:00:03.973 - 1265s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-05-173245}]
      Feb 05 15:17:47.103 - 1154s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-04-094135}]
      Feb 05 07:55:59.474 - 1162s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-04-190512}]
      Feb 05 12:15:30.132 - 1178s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-05-063300}]
      Feb 05 19:48:07.442 - 1588s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-05-173245}]

      Feb 05 23:30:46.180 - 5629s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.17
      Feb 05 19:02:16.918 - 1622s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
      Feb 05 22:05:50.214 - 1663s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
      Feb 05 22:54:19.037 - 6791s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.17
      Feb 05 09:47:44.404 - 1006s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
      Feb 05 20:20:47.845 - 1627s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
      Feb 06 03:40:24.441 - 1197s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
      Feb 05 23:28:33.815 - 5264s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.9.17}]
      Feb 05 06:20:32.073 - 1261s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-04-213359}]
      Feb 05 09:25:36.180 - 1434s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-04-213359}]

      Feb 05 12:20:24.804 - 1185s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-05-075430}]
      Feb 05 21:47:40.665 - 1198s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-05-141309}]
      Feb 06 04:41:02.410 - 1187s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-05-203247}]
      Feb 05 09:18:04.402 - 1321s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]
      Feb 05 12:31:23.489 - 1446s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]

      Feb 06 01:32:14.191 - 1011s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]
      Feb 06 04:57:35.973 - 1508s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]

      Feb 05 09:16:49.005 - 1198s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]
      Feb 05 22:44:04.061 - 1231s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.6.54
      Feb 05 09:30:33.921 - 1209s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-04-094135}]
      Feb 05 19:53:51.738 - 1054s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-05-132417}]
      Feb 05 20:12:54.733 - 1152s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-05-132417}]

      Feb 06 03:12:05.404 - 1024s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-05-190244}]
      Feb 06 03:18:47.421 - 1052s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-05-190244}]

      Feb 05 12:15:03.471 - 1386s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.nightly-2022-02-04-143931}]
      Feb 05 22:15:40.430 - 1044s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.nightly-2022-02-05-152519}

      ]
      Feb 05 17:21:15.357 - 1087s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

      {operator 4.11.0-0.nightly-2022-02-04-143931}

      ]
      Feb 05 09:31:14.667 - 1632s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

      {operator 4.10.0-0.okd-2022-02-05-081152}

      ]
      Feb 05 12:29:22.119 - 1060s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.8.0-0.okd-2022-02-05-101655
      Feb 05 17:43:45.938 - 1380s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.6.54
      Feb 06 02:35:34.300 - 1085s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

      {operator 4.11.0-0.ci.test-2022-02-06-011358-ci-op-xl025ywb-initial}

      ]
      Feb 06 06:15:23.991 - 1135s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

      {operator 4.11.0-0.ci.test-2022-02-06-044734-ci-op-1xyd57n7-initial}

      ]
      Feb 05 09:25:22.083 - 1071s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

      {operator 4.11.0-0.ci.test-2022-02-05-080202-ci-op-dl3w4ks4-initial}

      ]

      Breaking down by job name:

      $ w3m -dump -cols 200 'https://search.ci.openshift.org?maxAge=24h&type=junit&context=0&search=s+E+clusteroperator/machine-config+condition/Available+status/False' | grep 'failures match' | sort
      periodic-ci-openshift-multiarch-master-nightly-4.8-upgrade-from-nightly-4.7-ocp-remote-libvirt-s390x (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
      periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn-upgrade (all) - 70 runs, 47% failed, 6% of failures match = 3% impact
      periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade (all) - 40 runs, 60% failed, 4% of failures match = 3% impact
      periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade (all) - 76 runs, 42% failed, 9% of failures match = 4% impact
      periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade (all) - 77 runs, 65% failed, 4% of failures match = 3% impact
      periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
      periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
      periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade (all) - 41 runs, 61% failed, 12% of failures match = 7% impact
      periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
      periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade (all) - 80 runs, 59% failed, 4% of failures match = 3% impact
      periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade (all) - 82 runs, 51% failed, 7% of failures match = 4% impact
      periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 88 runs, 55% failed, 8% of failures match = 5% impact
      periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade (all) - 79 runs, 54% failed, 2% of failures match = 1% impact
      periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
      periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade (all) - 45 runs, 44% failed, 25% of failures match = 11% impact
      periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade (all) - 33 runs, 45% failed, 13% of failures match = 6% impact
      periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
      periodic-ci-openshift-release-master-okd-4.10-e2e-vsphere (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
      pull-ci-openshift-cluster-authentication-operator-master-e2e-agnostic-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact
      pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
      pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade (all) - 31 runs, 100% failed, 3% of failures match = 3% impact
      release-openshift-okd-installer-e2e-aws-upgrade (all) - 8 runs, 75% failed, 17% of failures match = 13% impact
      release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

      Those impact percentages are just matches; this particular test-case is non-fatal.

      The Available=False conditions also lack a 'reason', although they do contain a 'message', which is the same state we had back when I'd filed bug 1948088. Maybe we can pass through the Degraded reason around [4]?

      Going back to the run in [1], the Degraded condition had a few minutes at RenderConfigFailed, while [4] only has a carve out for RequiredPools. And then the Degraded condition went back to False, but for reasons I don't understand we remained Available=False until 22:33, when the MCO declared its portion of the update complete:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1490071797725401088/artifacts/e2e-aws-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'clusteroperator/machine-config '
      Feb 05 22:15:40.029 E clusteroperator/machine-config condition/Degraded status/True reason/RenderConfigFailed changed: Failed to resync 4.11.0-0.nightly-2022-02-05-152519 because: refusing to read images.json version "4.11.0-0.nightly-2022-02-05-211325", operator version "4.11.0-0.nightly-2022-02-05-152519"
      Feb 05 22:15:40.029 - 147s E clusteroperator/machine-config condition/Degraded status/True reason/Failed to resync 4.11.0-0.nightly-2022-02-05-152519 because: refusing to read images.json version "4.11.0-0.nightly-2022-02-05-211325", operator version "4.11.0-0.nightly-2022-02-05-152519"
      Feb 05 22:15:40.430 E clusteroperator/machine-config condition/Available status/False changed: Cluster not available for [

      {operator 4.11.0-0.nightly-2022-02-05-152519}]
      Feb 05 22:15:40.430 - 1044s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.nightly-2022-02-05-152519}

      ]
      Feb 05 22:18:07.150 W clusteroperator/machine-config condition/Progressing status/True changed: Working towards 4.11.0-0.nightly-2022-02-05-211325
      Feb 05 22:18:07.150 - 898s W clusteroperator/machine-config condition/Progressing status/True reason/Working towards 4.11.0-0.nightly-2022-02-05-211325
      Feb 05 22:18:07.178 W clusteroperator/machine-config condition/Degraded status/False changed:
      Feb 05 22:18:21.505 W clusteroperator/machine-config condition/Upgradeable status/False reason/PoolUpdating changed: One or more machine config pools are updating, please see `oc get mcp` for further details
      Feb 05 22:33:04.574 W clusteroperator/machine-config condition/Available status/True changed: Cluster has deployed [

      {operator 4.11.0-0.nightly-2022-02-05-152519}

      ]
      Feb 05 22:33:04.584 W clusteroperator/machine-config condition/Upgradeable status/True changed:
      Feb 05 22:33:04.931 I clusteroperator/machine-config versions: operator 4.11.0-0.nightly-2022-02-05-152519 -> 4.11.0-0.nightly-2022-02-05-211325
      Feb 05 22:33:05.531 W clusteroperator/machine-config condition/Progressing status/False changed: Cluster version is 4.11.0-0.nightly-2022-02-05-211325
      [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
      [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Degraded

      [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1490071797725401088
      [2]: https://github.com/openshift/cluster-version-operator/blob/06ec265e3a3bf47b599e56aec038022edbe8b5bb/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L79-L87
      [3]: https://github.com/openshift/cluster-version-operator/pull/643
      [4]: https://github.com/openshift/machine-config-operator/blob/2add8f323f396a2063257fc283f8eed9038ea0cd/pkg/operator/status.go#L122-L126

            djoshy David Joshy
            trking W. Trevor King
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Red Hat Employee
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: