Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.17.0
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:Platform:OVNK
- disruption

Severity:
Low
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

In CI, we've detected a pattern in two runs that indicate there's a potential bug with upgrade where in-cluster networking is affected for about 25s.

This is a borderline bug to file because we have only two runs where we've found this, both within the last week, 4.17, minor upgrade, and IPv6. I can't find it in 4.16, nor 4.18.

However the pattern is similar and we suspect this could be surfacing a real bug, so I figured it should be filed.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.17-upgrade-from-stable-4.16-e2e-metal-ipi-upgrade-ovn-ipv6/1825161042779443200

in this job it looks like master-0 is the culprit, during it's upgrade, the other nodes lose connectivity to it.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.17-upgrade-from-stable-4.16-e2e-metal-ipi-upgrade-ovn-ipv6/1824209286775967744

also worker-0

If you expand the intervals on both, during node update you'll see a huge swath of disruption on lots of in-cluster backends, but only those that are "*-to-host", implying a problem with the host network. Disruption seems to all be "to" the same node, worker-0 in both cases, which is being upgraded/rebooted at that time.

The disruption monitoring framework is supposed to handle a node going down by monitoring disruption only for hosts in an endpointslice on a service the poller creates. I'm unclear if this is a product bug or a problem with the monitoring itself. If it is a product bug, it would seem to imply that there is a problem with a node being upgraded/rebooted, but not removed from endpoint slices.

The disruption poller code is in: https://github.com/openshift/origin/blob/master/pkg/cmd/openshift-tests/disruption/watch-endpointslice/watch_endpointslice_options.go and was written by deads.

Assignee:: sdn-team bot

Reporter:: Devan Goodwin

QA Contact:: Arti Sood

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/08/19 5:03 PM

Updated:: 2024/09/10 1:39 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates