Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-849

Pinpoint possible CI cluster networking issues

    XMLWordPrintable

Details

    • Story
    • Resolution: Obsolete
    • Normal
    • None
    • None
    • None
    • False
    • None
    • False

    Description

      kenzhang@redhat.com identified a situation where we saw to the second simultaneous disruption against multiple backends where some reported a DNS lookup error (which we believe is a CI cluster problem), and some reported a more normal TCP i/o error which we thought was real disruption to the cluster. The fact both of these could be occurring independently at the same time is highly suspect, and we're wondering if there's a larger DNS issue at play, or a larger networking issue at place.

      An example, see second spyglass chart, 8:18:53: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-origin-27694-ci-4.13-e2e-azure-ovn-upgrade/1623906725591519232

      Data must be gathered to prove this is occurring, how often, on which NURPs, and which build clusters, etc. We also need this to detect if the problem is improving or fixed.

      This test detects DNS problems in the CI cluster, and continues to occur about 25% of the time. https://sippy.dptools.openshift.org/sippy-ng/tests/4.13/analysis?test=%5Bsig-trt%5D%20no%20DNS%20lookup%20errors%20should%20be%20encountered%20in%20disruption%20samplers

      Options proposed on the Feb 13 scrum call:

      Option 1: Separate DNS from normal TCP disruption testing

      Have openshift-tests do initial lookup of hostnames, write them all to /etc/hosts. Normal disruption testing will never do DNS lookups for these again.

      Add actual DNS query disruption testing as backends. Hit multiple DNS servers. Possibly multiple backends for each just to load up requests as we don't want to drop from 8 to 1, which may mask the problem.

      This would separate the two paths and potentially expose the problem, if we see both still occur simultaneously, we have identified a general networking problem in CI cluster.

      Option 2: Write unit test to correlate DNS disruption with real cluster disruption

      If we see to the second overlap (possibly +/- 1 second) of real vs DNS disruption, fail the test and inform when the overlap hit.

      This would then give us accurate reporting of how often the problem is occurring. It does not provide evidence one way or the other around which is the real problem, DNS or full networking.

      Option 3: Duplicate all backends with a copy that tests against direct host IPs

      Team mentioned SNI means this should work. kube-api-new-ip-connections then would go straight to the IP we looked up at the start of the disruption monitoring process.

      This should expose in spyglass if the problem is surfacing in the DNS path vs the direct to IP path or both at any given point in time.

      May require something like Option 2 as well to have signal on how often the problem occurs, and when we're getting better.

      Option 4: Spin up VM in cloud provider to do disruption testing

      This takes Core DNS in CI cluster out of the picture, then we could compare it's results with those we got from CI cluster, and determine if we're catching real disruption or not.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rhn-engineering-dgoodwin Devan Goodwin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: