-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.20.0
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
Proposed
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
TRT has found a regression in in-cluster disruption that showed a pretty defined pattern, which we do not see in 4.19.
During node upgrades when one of the masters is upgrading, (seemingly the second master to update), we get a brief disruption to almost all api endpoints, but interestingly also to a bunch of -to-pod internal networking backends all going to the master which is upgrading. Interestingly we do not see any -to-host or to-service backends failing at this time, indicating it's only the pod network that seems to be affected?
For reference, the in-cluster networking shuts itself off when it sees a hosts IPs disappear from the endpoint slice indicating it is being rebooted/upgraded. (iirc)
Examples:
For more runs see the job runs panel on this dashboard.
Most hits come from periodic-ci-openshift-release-master-nightly-4.20-upgrade-from-stable-4.19-e2e-metal-ipi-upgrade-ovn-ipv6 but notice one from periodic-ci-openshift-release-master-nightly-4.20-upgrade-from-stable-4.19-e2e-metal-ipi-ovn-upgrade shows up with the same pattern. Note that no 4.19 jobs show up despite being included in the report.
The problem potentially started around June 5th-6th, but this job does not run a lot so it's possible it was earlier.