Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18596

Metal CI clusters see disproportionate image registry disruption during upgrade

XMLWordPrintable

    • No
    • Sprint 243, Sprint 246
    • 2
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

      TRT has picked up on the fact that metal 4.14 clusters seem to be seeing an unacceptable amount of disruption to the image registry during upgrades.

      Graphs: https://grafana-loki.ci.openshift.org/d/ISnBj4LVk/disruption?orgId=1&var-platform=metal&var-percentile=P50&var-backend=image-registry-new-connections&var-backend=image-registry-reused-connections&var-backend=ingress-to-console-new-connections&var-backend=ingress-to-console-reused-connections&var-releases=4.14&var-upgrade_type=minor&var-upgrade_type=micro&var-networks=sdn&var-networks=ovn&var-topologies=ha&var-architectures=amd64&var-min_job_runs=10&var-lookback=7&var-master_node_updated=Y

      This shows between 12-45s P50 over a week's worth of job runs. Compare to ingress-to-console which only sees at most 2s for the P50.

      The problems appears to affect both micro and minor upgrades, both sdn and ovn networking. As such this appears to be a registry/metal problem, not a networking issue.

      Sample job runs:

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-upgrade-ovn-ipv6/1699317456872411136

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-upgrade-ovn-ipv6/1699317456872411136

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-metal-ipi-sdn-bm-upgrade/1699192726903328768

      If you expand the first "intervals" spyglass chart on each of these jobs, and search for "registry", you will see the disruption overlaps with the image-registry ClusterOperator reporting available false with a message of:

      condition/Available status/False reason/NoReplicasAvailable changed: Available: The deployment does not have available replicas
      NodeCADaemonAvailable: The daemon set node-ca has available replicas
      ImagePrunerAvailable: Pruner CronJob has been created
      

      Is there some reason metal specifically would encounter a problem here?

      Filing against registry as I don't know how to get to metal folks otherwise, but will loop them in on slack.

            fmissi Flavian Missi
            rhn-engineering-dgoodwin Devan Goodwin
            Wen Wang Wen Wang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: