-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.17.0
-
Low
-
None
-
False
-
In CI, we've detected a pattern in two runs that indicate there's a potential bug with upgrade where in-cluster networking is affected for about 25s.
This is a borderline bug to file because we have only two runs where we've found this, both within the last week, 4.17, minor upgrade, and IPv6. I can't find it in 4.16, nor 4.18.
However the pattern is similar and we suspect this could be surfacing a real bug, so I figured it should be filed.
- in this job it looks like master-0 is the culprit, during it's upgrade, the other nodes lose connectivity to it.
- also worker-0
If you expand the intervals on both, during node update you'll see a huge swath of disruption on lots of in-cluster backends, but only those that are "*-to-host", implying a problem with the host network. Disruption seems to all be "to" the same node, worker-0 in both cases, which is being upgraded/rebooted at that time.
The disruption monitoring framework is supposed to handle a node going down by monitoring disruption only for hosts in an endpointslice on a service the poller creates. I'm unclear if this is a product bug or a problem with the monitoring itself. If it is a product bug, it would seem to imply that there is a problem with a node being upgraded/rebooted, but not removed from endpoint slices.
The disruption poller code is in: https://github.com/openshift/origin/blob/master/pkg/cmd/openshift-tests/disruption/watch-endpointslice/watch_endpointslice_options.go and was written by deads.