[OCPBUGS-9108] openshift-tests-upgrade.[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16.0
Affects Version/s: 4.13, 4.12, 4.11, 4.14, 4.15
Component/s: Machine Config Operator
Labels:

Severity:
Moderate
Regression:
None
Epic Link:
Machine Config Node
Sprint:
MCO Sprint 250, MCO Sprint 251
sprint_count:
2
Release Blocker:
Rejected
Architecture:

Unspecified
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Release Note Status:
In Progress
Internal Whiteboard:
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Similar to bug 1955300, but seen in a recent 4.11-to-4.11 update [1]:

: [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
Run #0: Failed expand_less 47m16s
1 unexpected clusteroperator state transitions during e2e test run

Feb 05 22:15:40.430 - 1044s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

{operator 4.11.0-0.nightly-2022-02-05-152519}]

Not something that breaks the update, but still something that sounds pretty alarming, and which had ClusterOperatorDown over its 10m threshold [2]. In this case, the alert did not fire because the metric-serving CVO moved from one instance to another as the MCO was rolling the nodes, and we haven't attempted anything like [3] yet to guard ClusterOperatorDown against that sort of label drift.

Over the past 24h, seems like a bunch of hits across several versions, all for 17+ minutes:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&context=0&search=s+E+clusteroperator/machine-config+condition/Available+status/False' | jq -r 'to_entries[].value |
to_entries[].value[].context[]'
Feb 05 22:19:52.700 - 2038s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.8.0-0.nightly-s390x-2022-02-05-125308
Feb 05 18:57:26.413 - 1470s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-05-143255}]
Feb 05 22:00:03.973 - 1265s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-05-173245}]
Feb 05 15:17:47.103 - 1154s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-04-094135}]
Feb 05 07:55:59.474 - 1162s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-04-190512}]
Feb 05 12:15:30.132 - 1178s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-05-063300}]
Feb 05 19:48:07.442 - 1588s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-05-173245}]
Feb 05 23:30:46.180 - 5629s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.17
Feb 05 19:02:16.918 - 1622s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
Feb 05 22:05:50.214 - 1663s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
Feb 05 22:54:19.037 - 6791s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.17
Feb 05 09:47:44.404 - 1006s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
Feb 05 20:20:47.845 - 1627s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
Feb 06 03:40:24.441 - 1197s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
Feb 05 23:28:33.815 - 5264s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.9.17}]
Feb 05 06:20:32.073 - 1261s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-04-213359}]
Feb 05 09:25:36.180 - 1434s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-04-213359}]
Feb 05 12:20:24.804 - 1185s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-05-075430}]
Feb 05 21:47:40.665 - 1198s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-05-141309}]
Feb 06 04:41:02.410 - 1187s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-05-203247}]
Feb 05 09:18:04.402 - 1321s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]
Feb 05 12:31:23.489 - 1446s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]
Feb 06 01:32:14.191 - 1011s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]
Feb 06 04:57:35.973 - 1508s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]
Feb 05 09:16:49.005 - 1198s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]
Feb 05 22:44:04.061 - 1231s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.6.54
Feb 05 09:30:33.921 - 1209s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-04-094135}]
Feb 05 19:53:51.738 - 1054s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-05-132417}]
Feb 05 20:12:54.733 - 1152s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-05-132417}]
Feb 06 03:12:05.404 - 1024s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-05-190244}]
Feb 06 03:18:47.421 - 1052s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-05-190244}]
Feb 05 12:15:03.471 - 1386s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.nightly-2022-02-04-143931}]
Feb 05 22:15:40.430 - 1044s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.nightly-2022-02-05-152519}

]
Feb 05 17:21:15.357 - 1087s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

{operator 4.11.0-0.nightly-2022-02-04-143931}

]
Feb 05 09:31:14.667 - 1632s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

{operator 4.10.0-0.okd-2022-02-05-081152}

]
Feb 05 12:29:22.119 - 1060s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.8.0-0.okd-2022-02-05-101655
Feb 05 17:43:45.938 - 1380s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.6.54
Feb 06 02:35:34.300 - 1085s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

{operator 4.11.0-0.ci.test-2022-02-06-011358-ci-op-xl025ywb-initial}

]
Feb 06 06:15:23.991 - 1135s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

{operator 4.11.0-0.ci.test-2022-02-06-044734-ci-op-1xyd57n7-initial}

]
Feb 05 09:25:22.083 - 1071s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

{operator 4.11.0-0.ci.test-2022-02-05-080202-ci-op-dl3w4ks4-initial}

]

Breaking down by job name:

$ w3m -dump -cols 200 'https://search.ci.openshift.org?maxAge=24h&type=junit&context=0&search=s+E+clusteroperator/machine-config+condition/Available+status/False' | grep 'failures match' | sort
periodic-ci-openshift-multiarch-master-nightly-4.8-upgrade-from-nightly-4.7-ocp-remote-libvirt-s390x (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn-upgrade (all) - 70 runs, 47% failed, 6% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade (all) - 40 runs, 60% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade (all) - 76 runs, 42% failed, 9% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade (all) - 77 runs, 65% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade (all) - 41 runs, 61% failed, 12% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade (all) - 80 runs, 59% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade (all) - 82 runs, 51% failed, 7% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 88 runs, 55% failed, 8% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade (all) - 79 runs, 54% failed, 2% of failures match = 1% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade (all) - 45 runs, 44% failed, 25% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade (all) - 33 runs, 45% failed, 13% of failures match = 6% impact
periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-okd-4.10-e2e-vsphere (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
pull-ci-openshift-cluster-authentication-operator-master-e2e-agnostic-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade (all) - 31 runs, 100% failed, 3% of failures match = 3% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 8 runs, 75% failed, 17% of failures match = 13% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Those impact percentages are just matches; this particular test-case is non-fatal.

The Available=False conditions also lack a 'reason', although they do contain a 'message', which is the same state we had back when I'd filed bug 1948088. Maybe we can pass through the Degraded reason around [4]?

Going back to the run in [1], the Degraded condition had a few minutes at RenderConfigFailed, while [4] only has a carve out for RequiredPools. And then the Degraded condition went back to False, but for reasons I don't understand we remained Available=False until 22:33, when the MCO declared its portion of the update complete:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1490071797725401088/artifacts/e2e-aws-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'clusteroperator/machine-config '
Feb 05 22:15:40.029 E clusteroperator/machine-config condition/Degraded status/True reason/RenderConfigFailed changed: Failed to resync 4.11.0-0.nightly-2022-02-05-152519 because: refusing to read images.json version "4.11.0-0.nightly-2022-02-05-211325", operator version "4.11.0-0.nightly-2022-02-05-152519"
Feb 05 22:15:40.029 - 147s E clusteroperator/machine-config condition/Degraded status/True reason/Failed to resync 4.11.0-0.nightly-2022-02-05-152519 because: refusing to read images.json version "4.11.0-0.nightly-2022-02-05-211325", operator version "4.11.0-0.nightly-2022-02-05-152519"
Feb 05 22:15:40.430 E clusteroperator/machine-config condition/Available status/False changed: Cluster not available for [

{operator 4.11.0-0.nightly-2022-02-05-152519}]
Feb 05 22:15:40.430 - 1044s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.nightly-2022-02-05-152519}

]
Feb 05 22:18:07.150 W clusteroperator/machine-config condition/Progressing status/True changed: Working towards 4.11.0-0.nightly-2022-02-05-211325
Feb 05 22:18:07.150 - 898s W clusteroperator/machine-config condition/Progressing status/True reason/Working towards 4.11.0-0.nightly-2022-02-05-211325
Feb 05 22:18:07.178 W clusteroperator/machine-config condition/Degraded status/False changed:
Feb 05 22:18:21.505 W clusteroperator/machine-config condition/Upgradeable status/False reason/PoolUpdating changed: One or more machine config pools are updating, please see `oc get mcp` for further details
Feb 05 22:33:04.574 W clusteroperator/machine-config condition/Available status/True changed: Cluster has deployed [

{operator 4.11.0-0.nightly-2022-02-05-152519}

]
Feb 05 22:33:04.584 W clusteroperator/machine-config condition/Upgradeable status/True changed:
Feb 05 22:33:04.931 I clusteroperator/machine-config versions: operator 4.11.0-0.nightly-2022-02-05-152519 -> 4.11.0-0.nightly-2022-02-05-211325
Feb 05 22:33:05.531 W clusteroperator/machine-config condition/Progressing status/False changed: Cluster version is 4.11.0-0.nightly-2022-02-05-211325
[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Degraded

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1490071797725401088
[2]: https://github.com/openshift/cluster-version-operator/blob/06ec265e3a3bf47b599e56aec038022edbe8b5bb/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L79-L87
[3]: https://github.com/openshift/cluster-version-operator/pull/643
[4]: https://github.com/openshift/machine-config-operator/blob/2add8f323f396a2063257fc283f8eed9038ea0cd/pkg/operator/status.go#L122-L126

relates to

MCO-452 [tech-preview] Proper state reporting when the MCO changes state

Closed

OTA-362 CI: fail update suite if any ClusterOperator go Available=False

Closed

links to

openshift/machine-config-operator#4240: OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, add retry to applyManifests before degrading

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates