Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-24445

Some e2e-metal-ipi-sdn-serial-virtualmedia-bond job runs experience extreme network disruption in-cluster

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Do
    • Normal
    • None
    • 4.15.0
    • No
    • False
    • Hide

      None

      Show
      None

    Description

      TRT has discovered that some job runs of periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-serial-virtualmedia-bond appear to show catastrophic network problems within the cluster. The problem is relatively rare, but appears several times a month in both 4.14 and 4.15.

      This disruption monitoring is done via pods running within the cluster as daemonsets on every node testing connectivity to every other node across several configurations. (pod to pod, pod to service, host to pod, etc)

      Job runs experiencing this problem can be found here (scroll down to most recent job runs list).

      An example run would be: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-serial-virtualmedia-bond/1729365048024895488

      Logs might be extremely useful for this problem.

      Expand the Spyglass intervals chart near the top of the prow job page, and scroll down to the disruption section, you will see massive bars of red which I believe indicating this cluster experienced in-cluster disruption as soon as we started checking for it after install.

      This in-cluster framework is relatively new and there could be gotchas here, but most runs of this job do NOT experience this problem, so it appears it works at least most of the time. However only a few tests fail, which is unusual for what appears to be catastrophic loss of in-cluster network, so we'll have to be careful this isn't a red herring.

      Hovering over the bars of red shows the messages returned by the request attempts, they vary between i/o timeout and no route to host.

      Nov 15 may be the first time this appears. Rough guess is this affects maybe 10% of runs since then. The job may have only existed since Nov 5?

      Attachments

        Issue Links

          Activity

            People

              jluhrsen Jamo Luhrsen
              rhn-engineering-dgoodwin Devan Goodwin
              Arti Sood Arti Sood
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: