[TRT-849] Pinpoint possible CI cluster networking issues - Red Hat Issue Tracker

Type: Story
Resolution: Obsolete
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

kenzhang@redhat.com identified a situation where we saw to the second simultaneous disruption against multiple backends where some reported a DNS lookup error (which we believe is a CI cluster problem), and some reported a more normal TCP i/o error which we thought was real disruption to the cluster. The fact both of these could be occurring independently at the same time is highly suspect, and we're wondering if there's a larger DNS issue at play, or a larger networking issue at place.

An example, see second spyglass chart, 8:18:53: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-origin-27694-ci-4.13-e2e-azure-ovn-upgrade/1623906725591519232

Data must be gathered to prove this is occurring, how often, on which NURPs, and which build clusters, etc. We also need this to detect if the problem is improving or fixed.

This test detects DNS problems in the CI cluster, and continues to occur about 25% of the time. https://sippy.dptools.openshift.org/sippy-ng/tests/4.13/analysis?test=%5Bsig-trt%5D%20no%20DNS%20lookup%20errors%20should%20be%20encountered%20in%20disruption%20samplers

Options proposed on the Feb 13 scrum call:

Option 1: Separate DNS from normal TCP disruption testing

Have openshift-tests do initial lookup of hostnames, write them all to /etc/hosts. Normal disruption testing will never do DNS lookups for these again.

Add actual DNS query disruption testing as backends. Hit multiple DNS servers. Possibly multiple backends for each just to load up requests as we don't want to drop from 8 to 1, which may mask the problem.

This would separate the two paths and potentially expose the problem, if we see both still occur simultaneously, we have identified a general networking problem in CI cluster.

Option 2: Write unit test to correlate DNS disruption with real cluster disruption

If we see to the second overlap (possibly +/- 1 second) of real vs DNS disruption, fail the test and inform when the overlap hit.

This would then give us accurate reporting of how often the problem is occurring. It does not provide evidence one way or the other around which is the real problem, DNS or full networking.

Option 3: Duplicate all backends with a copy that tests against direct host IPs

Team mentioned SNI means this should work. kube-api-new-ip-connections then would go straight to the IP we looked up at the start of the disruption monitoring process.

This should expose in spyglass if the problem is surfacing in the DNS path vs the direct to IP path or both at any given point in time.

May require something like Option 2 as well to have signal on how often the problem occurs, and when we're getting better.

Option 4: Spin up VM in cloud provider to do disruption testing

This takes Core DNS in CI cluster out of the picture, then we could compare it's results with those we got from CI cluster, and determine if we're catching real disruption or not.

is blocked by

TRT-856 Write test to detect overlap between DNS lookup and real disruption

Closed

links to

openshift/origin#27719: Chart CI cluster DNS problems in different color from disruption

Assignee:: Unassigned

Reporter:: Devan Goodwin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/02/13 3:27 PM

Updated:: 2023/07/10 2:08 PM

Resolved:: 2023/07/10 2:08 PM

Details

Description

Option 1: Separate DNS from normal TCP disruption testing

Option 2: Write unit test to correlate DNS disruption with real cluster disruption

Option 3: Duplicate all backends with a copy that tests against direct host IPs

Option 4: Spin up VM in cloud provider to do disruption testing

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates