Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36166

New registry pull secrets roll the control plane after 4.16 cluster updates

XMLWordPrintable

    • Moderate
    • No
    • MCO Sprint 255, MCO Sprint 256
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • This now uses an internal mechanism to avoid rolling out repeated MachineConfig updates in response to changes to the internal registry pull secret.
    • Release Note Not Required
    • In Progress

      This is a clone of issue OCPBUGS-33913. The following is the description of the original issue:

      Description of problem

      CI is occasionally bumping into failures like:

      : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less	53m22s
      {  fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:186]: during upgrade to registry.build05.ci.openshift.org/ci-op-kj8vc4dt/release@sha256:74bc38fc3a1d5b5ac8e84566d54d827c8aa88019dbdbf3b02bef77715b93c210: the "master" pool should be updated before the CVO reports available at the new version
      Ginkgo exit error 1: exit with code 1}
      

      where the machine-config operator is rolling the control-plane MachineConfigPool after the ClusterVersion update completes:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/machineconfigpools.json | jq -r '.items[] | select(.metadata.name == "master").status | [.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message] | sort[]'
      2024-05-17T12:57:04Z RenderDegraded=False : 
      2024-05-17T12:58:35Z Degraded=False : 
      2024-05-17T12:58:35Z NodeDegraded=False : 
      2024-05-17T15:13:22Z Updated=True : All nodes are updated with MachineConfig rendered-master-4fcadad80c9941813b00ca7e3eef8e69
      2024-05-17T15:13:22Z Updating=False : 
      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[0].completionTime'
      2024-05-17T14:15:22Z
      

      Because of changes to registry pull secrets:

      $ dump() {
      > curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade
      /gather-extra/artifacts/machineconfigs.json | jq -r ".items[] | select(.metadata.name == \"$1\").spec.config.storage.files[] | select(.path == \"/etc/mco/internal-registry-pull-secret.json\").contents.source" | 
      python3 -c 'import urllib.parse, sys; print(urllib.parse.unquote(sys.stdin.read()).split(",", 1)[-1])' | jq -c '.auths | to_entries[]'
      > }
      $ diff -u0 <(dump rendered-master-d6a8cd53ae132250832cc8267e070af6) <(dump rendered-master-4fcadad80c9941813b00ca7e3eef8e69) | sed 's/"value":.*/.../'
      --- /dev/fd/63  2024-05-17 12:28:37.882351026 -0700
      +++ /dev/fd/62  2024-05-17 12:28:37.883351026 -0700
      @@ -1 +1 @@
      -{"key":"172.30.124.169:5000",...
      +{"key":"172.30.124.169:5000",...
      @@ -3,3 +3,3 @@
      -{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",...
      -{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",...
      -{"key":"image-registry.openshift-image-registry.svc:5000",...
      +{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",...
      +{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",...
      +{"key":"image-registry.openshift-image-registry.svc:5000",...
      

      Version-Release number of selected component (if applicable)

      Seen in 4.16-to-4.16 Azure update CI. Unclear what the wider scope is.

      How reproducible

      Sippy reports Success Rate: 94.27% post regression, so a rare race.

      But using CI search to pick jobs with 10 or more runs over the past 2 days:

      $ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=master.*pool+should+be+updated+before+the+CVO+reports+available' | grep '[0-9][0-9] runs.*failures ma
      tch' | sort
      periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade (all) - 52 runs, 50% failed, 12% of failures match = 6% impact
      periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade (all) - 80 runs, 20% failed, 25% of failures match = 5% impact
      periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade (all) - 82 runs, 21% failed, 59% of failures match = 12% impact
      periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade (all) - 80 runs, 53% failed, 14% of failures match = 8% impact
      periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-sdn-upgrade (all) - 50 runs, 12% failed, 50% of failures match = 6% impact
      pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-ovn-upgrade (all) - 14 runs, 21% failed, 33% of failures match = 7% impact
      pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade (all) - 11 runs, 36% failed, 75% of failures match = 27% impact
      pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 11 runs, 18% failed, 100% of failures match = 18% impact
      pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade (all) - 19 runs, 21% failed, 25% of failures match = 5% impact
      pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change (all) - 21 runs, 48% failed, 50% of failures match = 24% impact
      pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade (all) - 16 runs, 81% failed, 15% of failures match = 13% impact
      pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 16 runs, 25% failed, 75% of failures match = 19% impact
      pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade (all) - 26 runs, 35% failed, 67% of failures match = 23% impact
      

      shows some flavors like pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade up at a 27% hit rates.

      Steps to Reproduce

      Unclear.

      Actual results

      Pull secret changes after the ClusterVersion update cause an unexpected master MachineConfigPool roll.

      Expected results

      No MachineConfigPool roll after the ClusterVersion update completes.

      Additional info

            zzlotnik@redhat.com Zack Zlotnik
            openshift-crt-jira-prow OpenShift Prow Bot
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: