Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.15.0
Component/s: Networking / openshift-sdn
Labels:
- trt

Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

TRT has discovered that some job runs of periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-serial-virtualmedia-bond appear to show catastrophic network problems within the cluster. The problem is relatively rare, but appears several times a month in both 4.14 and 4.15.

This disruption monitoring is done via pods running within the cluster as daemonsets on every node testing connectivity to every other node across several configurations. (pod to pod, pod to service, host to pod, etc)

Job runs experiencing this problem can be found here (scroll down to most recent job runs list).

An example run would be: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-serial-virtualmedia-bond/1729365048024895488

Logs might be extremely useful for this problem.

Expand the Spyglass intervals chart near the top of the prow job page, and scroll down to the disruption section, you will see massive bars of red which I believe indicating this cluster experienced in-cluster disruption as soon as we started checking for it after install.

This in-cluster framework is relatively new and there could be gotchas here, but most runs of this job do NOT experience this problem, so it appears it works at least most of the time. However only a few tests fail, which is unusual for what appears to be catastrophic loss of in-cluster network, so we'll have to be careful this isn't a red herring.

Hovering over the bars of red shows the messages returned by the request attempts, they vary between i/o timeout and no route to host.

Nov 15 may be the first time this appears. Rough guess is this affects maybe 10% of runs since then. The job may have only existed since Nov 5?

is related to

TRT-1350 Investigate metal-ipi-sdn-serial-virtualmedia-bond in-cluster disruption

Closed

Assignee:: Jamo Luhrsen

Reporter:: Devan Goodwin

QA Contact:: Arti Sood

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/12/05 3:16 PM

Updated:: 2024/01/29 6:33 PM

Resolved:: 2024/01/29 6:33 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates