-
Story
-
Resolution: Obsolete
-
Normal
-
None
-
None
-
None
-
False
-
None
-
False
-
-
kenzhang@redhat.com identified a situation where we saw to the second simultaneous disruption against multiple backends where some reported a DNS lookup error (which we believe is a CI cluster problem), and some reported a more normal TCP i/o error which we thought was real disruption to the cluster. The fact both of these could be occurring independently at the same time is highly suspect, and we're wondering if there's a larger DNS issue at play, or a larger networking issue at place.
An example, see second spyglass chart, 8:18:53: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-origin-27694-ci-4.13-e2e-azure-ovn-upgrade/1623906725591519232
Data must be gathered to prove this is occurring, how often, on which NURPs, and which build clusters, etc. We also need this to detect if the problem is improving or fixed.
This test detects DNS problems in the CI cluster, and continues to occur about 25% of the time. https://sippy.dptools.openshift.org/sippy-ng/tests/4.13/analysis?test=%5Bsig-trt%5D%20no%20DNS%20lookup%20errors%20should%20be%20encountered%20in%20disruption%20samplers
Options proposed on the Feb 13 scrum call:
Option 1: Separate DNS from normal TCP disruption testing
Have openshift-tests do initial lookup of hostnames, write them all to /etc/hosts. Normal disruption testing will never do DNS lookups for these again.
Add actual DNS query disruption testing as backends. Hit multiple DNS servers. Possibly multiple backends for each just to load up requests as we don't want to drop from 8 to 1, which may mask the problem.
This would separate the two paths and potentially expose the problem, if we see both still occur simultaneously, we have identified a general networking problem in CI cluster.
Option 2: Write unit test to correlate DNS disruption with real cluster disruption
If we see to the second overlap (possibly +/- 1 second) of real vs DNS disruption, fail the test and inform when the overlap hit.
This would then give us accurate reporting of how often the problem is occurring. It does not provide evidence one way or the other around which is the real problem, DNS or full networking.
Option 3: Duplicate all backends with a copy that tests against direct host IPs
Team mentioned SNI means this should work. kube-api-new-ip-connections then would go straight to the IP we looked up at the start of the disruption monitoring process.
This should expose in spyglass if the problem is surfacing in the DNS path vs the direct to IP path or both at any given point in time.
May require something like Option 2 as well to have signal on how often the problem occurs, and when we're getting better.
Option 4: Spin up VM in cloud provider to do disruption testing
This takes Core DNS in CI cluster out of the picture, then we could compare it's results with those we got from CI cluster, and determine if we're catching real disruption or not.
- is blocked by
-
TRT-856 Write test to detect overlap between DNS lookup and real disruption
- Closed
- links to