Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.16, 4.17
Component/s: Machine Config Operator
Labels:
- mco-triaged

Severity:
Moderate
Regression:
None
Sprint:
MCO Sprint 254, MCO Sprint 255
sprint_count:
2
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
Previously, the `/etc/mco/internal-registry-pull-secret.json` secret was being managed my the Machine Config Operator (MCO). Due to a recent change, this secret is rotated on an hourly basis. Whenever the MCO detects a change to this secret, it rolls the secret out to each node in the cluster, which resulted in disruptions. With this fix, a different internal mechanism processes changes to the internal registry pull secret to avoid rolling out repeated MachineConfig updates.
(link:https://issues.redhat.com/browse/OCPBUGS-33913[*~~OCPBUGS-33913~~*])

Show
Previously, the `/etc/mco/internal-registry-pull-secret.json` secret was being managed my the Machine Config Operator (MCO). Due to a recent change, this secret is rotated on an hourly basis. Whenever the MCO detects a change to this secret, it rolls the secret out to each node in the cluster, which resulted in disruptions. With this fix, a different internal mechanism processes changes to the internal registry pull secret to avoid rolling out repeated MachineConfig updates. (link: https://issues.redhat.com/browse/OCPBUGS-33913 [* OCPBUGS-33913 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.17.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem

CI is occasionally bumping into failures like:

: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less	53m22s
{  fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:186]: during upgrade to registry.build05.ci.openshift.org/ci-op-kj8vc4dt/release@sha256:74bc38fc3a1d5b5ac8e84566d54d827c8aa88019dbdbf3b02bef77715b93c210: the "master" pool should be updated before the CVO reports available at the new version
Ginkgo exit error 1: exit with code 1}

where the machine-config operator is rolling the control-plane MachineConfigPool after the ClusterVersion update completes:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/machineconfigpools.json | jq -r '.items[] | select(.metadata.name == "master").status | [.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message] | sort[]'
2024-05-17T12:57:04Z RenderDegraded=False : 
2024-05-17T12:58:35Z Degraded=False : 
2024-05-17T12:58:35Z NodeDegraded=False : 
2024-05-17T15:13:22Z Updated=True : All nodes are updated with MachineConfig rendered-master-4fcadad80c9941813b00ca7e3eef8e69
2024-05-17T15:13:22Z Updating=False : 
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[0].completionTime'
2024-05-17T14:15:22Z

Because of changes to registry pull secrets:

$ dump() {
> curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade
/gather-extra/artifacts/machineconfigs.json | jq -r ".items[] | select(.metadata.name == \"$1\").spec.config.storage.files[] | select(.path == \"/etc/mco/internal-registry-pull-secret.json\").contents.source" | 
python3 -c 'import urllib.parse, sys; print(urllib.parse.unquote(sys.stdin.read()).split(",", 1)[-1])' | jq -c '.auths | to_entries[]'
> }
$ diff -u0 <(dump rendered-master-d6a8cd53ae132250832cc8267e070af6) <(dump rendered-master-4fcadad80c9941813b00ca7e3eef8e69) | sed 's/"value":.*/.../'
--- /dev/fd/63  2024-05-17 12:28:37.882351026 -0700
+++ /dev/fd/62  2024-05-17 12:28:37.883351026 -0700
@@ -1 +1 @@
-{"key":"172.30.124.169:5000",...
+{"key":"172.30.124.169:5000",...
@@ -3,3 +3,3 @@
-{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",...
-{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",...
-{"key":"image-registry.openshift-image-registry.svc:5000",...
+{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",...
+{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",...
+{"key":"image-registry.openshift-image-registry.svc:5000",...

Version-Release number of selected component (if applicable)

Seen in 4.16-to-4.16 Azure update CI. Unclear what the wider scope is.

How reproducible

Sippy reports Success Rate: 94.27% post regression, so a rare race.

But using CI search to pick jobs with 10 or more runs over the past 2 days:

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=master.*pool+should+be+updated+before+the+CVO+reports+available' | grep '[0-9][0-9] runs.*failures ma
tch' | sort
periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade (all) - 52 runs, 50% failed, 12% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade (all) - 80 runs, 20% failed, 25% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade (all) - 82 runs, 21% failed, 59% of failures match = 12% impact
periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade (all) - 80 runs, 53% failed, 14% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-sdn-upgrade (all) - 50 runs, 12% failed, 50% of failures match = 6% impact
pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-ovn-upgrade (all) - 14 runs, 21% failed, 33% of failures match = 7% impact
pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade (all) - 11 runs, 36% failed, 75% of failures match = 27% impact
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 11 runs, 18% failed, 100% of failures match = 18% impact
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade (all) - 19 runs, 21% failed, 25% of failures match = 5% impact
pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change (all) - 21 runs, 48% failed, 50% of failures match = 24% impact
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade (all) - 16 runs, 81% failed, 15% of failures match = 13% impact
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 16 runs, 25% failed, 75% of failures match = 19% impact
pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade (all) - 26 runs, 35% failed, 67% of failures match = 23% impact

shows some flavors like pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade up at a 27% hit rates.

Steps to Reproduce

Unclear.

Actual results

Pull secret changes after the ClusterVersion update cause an unexpected master MachineConfigPool roll.

Expected results

No MachineConfigPool roll after the ClusterVersion update completes.

Additional info

blocks

OCPBUGS-36166 New registry pull secrets roll the control plane after 4.16 cluster updates

Closed

is cloned by

TRT-1683 New registry pull secrets roll the control plane after 4.16 cluster updates

Closed

OCPBUGS-36166 New registry pull secrets roll the control plane after 4.16 cluster updates

Closed

relates to

OCPBUGS-33803 machine-os-puller SA refreshes every hour, causing machine config regeneration

Closed

OCPBUGS-33815 openshift-controller-manager overwriting/undoing changes to ServiceAccount imagePullSecrets

Closed

links to

openshift/machine-config-operator#4395: OCPBUGS-33913, OCPBUGS-34261: CurrentImagePullSecret should be consumed by the MCD

RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update

(2 links to)

Assignee:: Zack Zlotnik

Reporter:: W. Trevor King

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Created:: 2024/05/17 7:44 PM

Updated:: 2024/10/01 5:36 PM

Resolved:: 2024/10/01 5:36 PM

Details

Description

Description of problem

Version-Release number of selected component (if applicable)

How reproducible

Steps to Reproduce

Actual results

Expected results

Additional info

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates