-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.16, 4.17
-
Moderate
-
No
-
MCO Sprint 255, MCO Sprint 256
-
2
-
Rejected
-
False
-
-
This now uses an internal mechanism to avoid rolling out repeated MachineConfig updates in response to changes to the internal registry pull secret.
-
Release Note Not Required
-
In Progress
This is a clone of issue OCPBUGS-33913. The following is the description of the original issue:
—
Description of problem
CI is occasionally bumping into failures like:
: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less 53m22s { fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:186]: during upgrade to registry.build05.ci.openshift.org/ci-op-kj8vc4dt/release@sha256:74bc38fc3a1d5b5ac8e84566d54d827c8aa88019dbdbf3b02bef77715b93c210: the "master" pool should be updated before the CVO reports available at the new version Ginkgo exit error 1: exit with code 1}
where the machine-config operator is rolling the control-plane MachineConfigPool after the ClusterVersion update completes:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/machineconfigpools.json | jq -r '.items[] | select(.metadata.name == "master").status | [.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message] | sort[]' 2024-05-17T12:57:04Z RenderDegraded=False : 2024-05-17T12:58:35Z Degraded=False : 2024-05-17T12:58:35Z NodeDegraded=False : 2024-05-17T15:13:22Z Updated=True : All nodes are updated with MachineConfig rendered-master-4fcadad80c9941813b00ca7e3eef8e69 2024-05-17T15:13:22Z Updating=False : $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[0].completionTime' 2024-05-17T14:15:22Z
Because of changes to registry pull secrets:
$ dump() { > curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade /gather-extra/artifacts/machineconfigs.json | jq -r ".items[] | select(.metadata.name == \"$1\").spec.config.storage.files[] | select(.path == \"/etc/mco/internal-registry-pull-secret.json\").contents.source" | python3 -c 'import urllib.parse, sys; print(urllib.parse.unquote(sys.stdin.read()).split(",", 1)[-1])' | jq -c '.auths | to_entries[]' > } $ diff -u0 <(dump rendered-master-d6a8cd53ae132250832cc8267e070af6) <(dump rendered-master-4fcadad80c9941813b00ca7e3eef8e69) | sed 's/"value":.*/.../' --- /dev/fd/63 2024-05-17 12:28:37.882351026 -0700 +++ /dev/fd/62 2024-05-17 12:28:37.883351026 -0700 @@ -1 +1 @@ -{"key":"172.30.124.169:5000",... +{"key":"172.30.124.169:5000",... @@ -3,3 +3,3 @@ -{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",... -{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",... -{"key":"image-registry.openshift-image-registry.svc:5000",... +{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",... +{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",... +{"key":"image-registry.openshift-image-registry.svc:5000",...
Version-Release number of selected component (if applicable)
Seen in 4.16-to-4.16 Azure update CI. Unclear what the wider scope is.
How reproducible
Sippy reports Success Rate: 94.27% post regression, so a rare race.
But using CI search to pick jobs with 10 or more runs over the past 2 days:
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=master.*pool+should+be+updated+before+the+CVO+reports+available' | grep '[0-9][0-9] runs.*failures ma tch' | sort periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade (all) - 52 runs, 50% failed, 12% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade (all) - 80 runs, 20% failed, 25% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade (all) - 82 runs, 21% failed, 59% of failures match = 12% impact periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade (all) - 80 runs, 53% failed, 14% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-sdn-upgrade (all) - 50 runs, 12% failed, 50% of failures match = 6% impact pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-ovn-upgrade (all) - 14 runs, 21% failed, 33% of failures match = 7% impact pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade (all) - 11 runs, 36% failed, 75% of failures match = 27% impact pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 11 runs, 18% failed, 100% of failures match = 18% impact pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade (all) - 19 runs, 21% failed, 25% of failures match = 5% impact pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change (all) - 21 runs, 48% failed, 50% of failures match = 24% impact pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade (all) - 16 runs, 81% failed, 15% of failures match = 13% impact pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 16 runs, 25% failed, 75% of failures match = 19% impact pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade (all) - 26 runs, 35% failed, 67% of failures match = 23% impact
shows some flavors like pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade up at a 27% hit rates.
Steps to Reproduce
Unclear.
Actual results
Pull secret changes after the ClusterVersion update cause an unexpected master MachineConfigPool roll.
Expected results
No MachineConfigPool roll after the ClusterVersion update completes.
Additional info
- clones
-
OCPBUGS-33913 New registry pull secrets roll the control plane after 4.16 cluster updates
- Closed
- is blocked by
-
OCPBUGS-33913 New registry pull secrets roll the control plane after 4.16 cluster updates
- Closed
- links to
-
RHBA-2024:4316 OpenShift Container Platform 4.16.z bug fix update