-
Bug
-
Resolution: Won't Do
-
Normal
-
None
-
4.15.0
-
No
-
False
-
TRT has discovered that some job runs of periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-serial-virtualmedia-bond appear to show catastrophic network problems within the cluster. The problem is relatively rare, but appears several times a month in both 4.14 and 4.15.
This disruption monitoring is done via pods running within the cluster as daemonsets on every node testing connectivity to every other node across several configurations. (pod to pod, pod to service, host to pod, etc)
Job runs experiencing this problem can be found here (scroll down to most recent job runs list).
An example run would be: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-serial-virtualmedia-bond/1729365048024895488
Logs might be extremely useful for this problem.
Expand the Spyglass intervals chart near the top of the prow job page, and scroll down to the disruption section, you will see massive bars of red which I believe indicating this cluster experienced in-cluster disruption as soon as we started checking for it after install.
This in-cluster framework is relatively new and there could be gotchas here, but most runs of this job do NOT experience this problem, so it appears it works at least most of the time. However only a few tests fail, which is unusual for what appears to be catastrophic loss of in-cluster network, so we'll have to be careful this isn't a red herring.
Hovering over the bars of red shows the messages returned by the request attempts, they vary between i/o timeout and no route to host.
Nov 15 may be the first time this appears. Rough guess is this affects maybe 10% of runs since then. The job may have only existed since Nov 5?
- is related to
-
TRT-1350 Investigate metal-ipi-sdn-serial-virtualmedia-bond in-cluster disruption
- Closed