Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-33913

New registry pull secrets roll the control plane after 4.16 cluster updates

XMLWordPrintable

    • Moderate
    • None
    • MCO Sprint 254, MCO Sprint 255
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      Previously, the `/etc/mco/internal-registry-pull-secret.json` secret was being managed my the Machine Config Operator (MCO). Due to a recent change, this secret is rotated on an hourly basis. Whenever the MCO detects a change to this secret, it rolls the secret out to each node in the cluster, which resulted in disruptions. With this fix, a different internal mechanism processes changes to the internal registry pull secret to avoid rolling out repeated MachineConfig updates.
      (link:https://issues.redhat.com/browse/OCPBUGS-33913[*OCPBUGS-33913*])
      Show
      Previously, the `/etc/mco/internal-registry-pull-secret.json` secret was being managed my the Machine Config Operator (MCO). Due to a recent change, this secret is rotated on an hourly basis. Whenever the MCO detects a change to this secret, it rolls the secret out to each node in the cluster, which resulted in disruptions. With this fix, a different internal mechanism processes changes to the internal registry pull secret to avoid rolling out repeated MachineConfig updates. (link: https://issues.redhat.com/browse/OCPBUGS-33913 [* OCPBUGS-33913 *])
    • Bug Fix
    • Done

      Description of problem

      CI is occasionally bumping into failures like:

      : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less	53m22s
      {  fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:186]: during upgrade to registry.build05.ci.openshift.org/ci-op-kj8vc4dt/release@sha256:74bc38fc3a1d5b5ac8e84566d54d827c8aa88019dbdbf3b02bef77715b93c210: the "master" pool should be updated before the CVO reports available at the new version
      Ginkgo exit error 1: exit with code 1}
      

      where the machine-config operator is rolling the control-plane MachineConfigPool after the ClusterVersion update completes:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/machineconfigpools.json | jq -r '.items[] | select(.metadata.name == "master").status | [.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message] | sort[]'
      2024-05-17T12:57:04Z RenderDegraded=False : 
      2024-05-17T12:58:35Z Degraded=False : 
      2024-05-17T12:58:35Z NodeDegraded=False : 
      2024-05-17T15:13:22Z Updated=True : All nodes are updated with MachineConfig rendered-master-4fcadad80c9941813b00ca7e3eef8e69
      2024-05-17T15:13:22Z Updating=False : 
      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[0].completionTime'
      2024-05-17T14:15:22Z
      

      Because of changes to registry pull secrets:

      $ dump() {
      > curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade
      /gather-extra/artifacts/machineconfigs.json | jq -r ".items[] | select(.metadata.name == \"$1\").spec.config.storage.files[] | select(.path == \"/etc/mco/internal-registry-pull-secret.json\").contents.source" | 
      python3 -c 'import urllib.parse, sys; print(urllib.parse.unquote(sys.stdin.read()).split(",", 1)[-1])' | jq -c '.auths | to_entries[]'
      > }
      $ diff -u0 <(dump rendered-master-d6a8cd53ae132250832cc8267e070af6) <(dump rendered-master-4fcadad80c9941813b00ca7e3eef8e69) | sed 's/"value":.*/.../'
      --- /dev/fd/63  2024-05-17 12:28:37.882351026 -0700
      +++ /dev/fd/62  2024-05-17 12:28:37.883351026 -0700
      @@ -1 +1 @@
      -{"key":"172.30.124.169:5000",...
      +{"key":"172.30.124.169:5000",...
      @@ -3,3 +3,3 @@
      -{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",...
      -{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",...
      -{"key":"image-registry.openshift-image-registry.svc:5000",...
      +{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",...
      +{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",...
      +{"key":"image-registry.openshift-image-registry.svc:5000",...
      

      Version-Release number of selected component (if applicable)

      Seen in 4.16-to-4.16 Azure update CI. Unclear what the wider scope is.

      How reproducible

      Sippy reports Success Rate: 94.27% post regression, so a rare race.

      But using CI search to pick jobs with 10 or more runs over the past 2 days:

      $ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=master.*pool+should+be+updated+before+the+CVO+reports+available' | grep '[0-9][0-9] runs.*failures ma
      tch' | sort
      periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade (all) - 52 runs, 50% failed, 12% of failures match = 6% impact
      periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade (all) - 80 runs, 20% failed, 25% of failures match = 5% impact
      periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade (all) - 82 runs, 21% failed, 59% of failures match = 12% impact
      periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade (all) - 80 runs, 53% failed, 14% of failures match = 8% impact
      periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-sdn-upgrade (all) - 50 runs, 12% failed, 50% of failures match = 6% impact
      pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-ovn-upgrade (all) - 14 runs, 21% failed, 33% of failures match = 7% impact
      pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade (all) - 11 runs, 36% failed, 75% of failures match = 27% impact
      pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 11 runs, 18% failed, 100% of failures match = 18% impact
      pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade (all) - 19 runs, 21% failed, 25% of failures match = 5% impact
      pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change (all) - 21 runs, 48% failed, 50% of failures match = 24% impact
      pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade (all) - 16 runs, 81% failed, 15% of failures match = 13% impact
      pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 16 runs, 25% failed, 75% of failures match = 19% impact
      pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade (all) - 26 runs, 35% failed, 67% of failures match = 23% impact
      

      shows some flavors like pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade up at a 27% hit rates.

      Steps to Reproduce

      Unclear.

      Actual results

      Pull secret changes after the ClusterVersion update cause an unexpected master MachineConfigPool roll.

      Expected results

      No MachineConfigPool roll after the ClusterVersion update completes.

      Additional info

            zzlotnik@redhat.com Zack Zlotnik
            trking W. Trevor King
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: