-
Bug
-
Resolution: Done-Errata
-
Critical
-
None
-
4.16, 4.17
-
Moderate
-
None
-
MCO Sprint 254, MCO Sprint 255
-
2
-
Rejected
-
False
-
-
-
Bug Fix
-
Done
Description of problem
CI is occasionally bumping into failures like:
: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less 53m22s { fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:186]: during upgrade to registry.build05.ci.openshift.org/ci-op-kj8vc4dt/release@sha256:74bc38fc3a1d5b5ac8e84566d54d827c8aa88019dbdbf3b02bef77715b93c210: the "master" pool should be updated before the CVO reports available at the new version Ginkgo exit error 1: exit with code 1}
where the machine-config operator is rolling the control-plane MachineConfigPool after the ClusterVersion update completes:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/machineconfigpools.json | jq -r '.items[] | select(.metadata.name == "master").status | [.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message] | sort[]' 2024-05-17T12:57:04Z RenderDegraded=False : 2024-05-17T12:58:35Z Degraded=False : 2024-05-17T12:58:35Z NodeDegraded=False : 2024-05-17T15:13:22Z Updated=True : All nodes are updated with MachineConfig rendered-master-4fcadad80c9941813b00ca7e3eef8e69 2024-05-17T15:13:22Z Updating=False : $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[0].completionTime' 2024-05-17T14:15:22Z
Because of changes to registry pull secrets:
$ dump() { > curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade /gather-extra/artifacts/machineconfigs.json | jq -r ".items[] | select(.metadata.name == \"$1\").spec.config.storage.files[] | select(.path == \"/etc/mco/internal-registry-pull-secret.json\").contents.source" | python3 -c 'import urllib.parse, sys; print(urllib.parse.unquote(sys.stdin.read()).split(",", 1)[-1])' | jq -c '.auths | to_entries[]' > } $ diff -u0 <(dump rendered-master-d6a8cd53ae132250832cc8267e070af6) <(dump rendered-master-4fcadad80c9941813b00ca7e3eef8e69) | sed 's/"value":.*/.../' --- /dev/fd/63 2024-05-17 12:28:37.882351026 -0700 +++ /dev/fd/62 2024-05-17 12:28:37.883351026 -0700 @@ -1 +1 @@ -{"key":"172.30.124.169:5000",... +{"key":"172.30.124.169:5000",... @@ -3,3 +3,3 @@ -{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",... -{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",... -{"key":"image-registry.openshift-image-registry.svc:5000",... +{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",... +{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",... +{"key":"image-registry.openshift-image-registry.svc:5000",...
Version-Release number of selected component (if applicable)
Seen in 4.16-to-4.16 Azure update CI. Unclear what the wider scope is.
How reproducible
Sippy reports Success Rate: 94.27% post regression, so a rare race.
But using CI search to pick jobs with 10 or more runs over the past 2 days:
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=master.*pool+should+be+updated+before+the+CVO+reports+available' | grep '[0-9][0-9] runs.*failures ma tch' | sort periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade (all) - 52 runs, 50% failed, 12% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade (all) - 80 runs, 20% failed, 25% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade (all) - 82 runs, 21% failed, 59% of failures match = 12% impact periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade (all) - 80 runs, 53% failed, 14% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-sdn-upgrade (all) - 50 runs, 12% failed, 50% of failures match = 6% impact pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-ovn-upgrade (all) - 14 runs, 21% failed, 33% of failures match = 7% impact pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade (all) - 11 runs, 36% failed, 75% of failures match = 27% impact pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 11 runs, 18% failed, 100% of failures match = 18% impact pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade (all) - 19 runs, 21% failed, 25% of failures match = 5% impact pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change (all) - 21 runs, 48% failed, 50% of failures match = 24% impact pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade (all) - 16 runs, 81% failed, 15% of failures match = 13% impact pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 16 runs, 25% failed, 75% of failures match = 19% impact pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade (all) - 26 runs, 35% failed, 67% of failures match = 23% impact
shows some flavors like pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade up at a 27% hit rates.
Steps to Reproduce
Unclear.
Actual results
Pull secret changes after the ClusterVersion update cause an unexpected master MachineConfigPool roll.
Expected results
No MachineConfigPool roll after the ClusterVersion update completes.
Additional info
- blocks
-
OCPBUGS-36166 New registry pull secrets roll the control plane after 4.16 cluster updates
- Closed
- is cloned by
-
TRT-1683 New registry pull secrets roll the control plane after 4.16 cluster updates
- Closed
-
OCPBUGS-36166 New registry pull secrets roll the control plane after 4.16 cluster updates
- Closed
- relates to
-
OCPBUGS-33815 openshift-controller-manager overwriting/undoing changes to ServiceAccount imagePullSecrets
- Verified
-
OCPBUGS-33803 machine-os-puller SA refreshes every hour, causing machine config regeneration
- Closed
- links to
-
RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update