Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57529

Metal disruption during master upgrade

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Proposed
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      TRT has found a regression in in-cluster disruption that showed a pretty defined pattern, which we do not see in 4.19.

      During node upgrades when one of the masters is upgrading, (seemingly the second master to update), we get a brief disruption to almost all api endpoints, but interestingly also to a bunch of -to-pod internal networking backends all going to the master which is upgrading. Interestingly we do not see any -to-host or to-service backends failing at this time, indicating it's only the pod network that seems to be affected?

      For reference, the in-cluster networking shuts itself off when it sees a hosts IPs disappear from the endpoint slice indicating it is being rebooted/upgraded. (iirc)

      Examples:

      https://sippy.dptools.openshift.org/sippy-ng/job_runs/1931506638221479936/periodic-ci-openshift-release-master-nightly-4.20-upgrade-from-stable-4.19-e2e-metal-ipi-upgrade-ovn-ipv6/intervals?end=2025-06-08T03%3A15%3A53Z&filterText=&intervalFile=e2e-timelines_spyglass_20250608-020648.json&overrideDisplayFlag=0&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=KubeEvent&selectedSources=NodeState&start=2025-06-08T00%3A40%3A00Z

      https://sippy.dptools.openshift.org/sippy-ng/job_runs/1933886484792741888/periodic-ci-openshift-release-master-nightly-4.20-upgrade-from-stable-4.19-e2e-metal-ipi-upgrade-ovn-ipv6/intervals?end=2025-06-14T16%3A57%3A18Z&filterText=&intervalFile=e2e-timelines_spyglass_20250614-154711.json&overrideDisplayFlag=0&selectedSources=OperatorDegraded&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=NodeState&start=2025-06-14T16%3A3630Z

      https://sippy.dptools.openshift.org/sippy-ng/job_runs/1934296704639569920/periodic-ci-openshift-release-master-nightly-4.20-upgrade-from-stable-4.19-e2e-metal-ipi-upgrade-ovn-ipv6/intervals?end=2025-06-15T20%3A09%3A42Z&filterText=&intervalFile=e2e-timelines_spyglass_20250615-185807.json&overrideDisplayFlag=0&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState&start=2025-06-15T17%3A30%3A53Z

      For more runs see the job runs panel on this dashboard.

      Most hits come from periodic-ci-openshift-release-master-nightly-4.20-upgrade-from-stable-4.19-e2e-metal-ipi-upgrade-ovn-ipv6 but notice one from periodic-ci-openshift-release-master-nightly-4.20-upgrade-from-stable-4.19-e2e-metal-ipi-ovn-upgrade shows up with the same pattern. Note that no 4.19 jobs show up despite being included in the report.

      The problem potentially started around June 5th-6th, but this job does not run a lot so it's possible it was earlier.

              weliang1@redhat.com Weibin Liang
              rhn-engineering-dgoodwin Devan Goodwin
              None
              None
              Weibin Liang Weibin Liang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: