Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-38660

Intra-cluster disruption on metal in rare scenarios

XMLWordPrintable

    • Low
    • None
    • False
    • Hide

      None

      Show
      None

      In CI, we've detected a pattern in two runs that indicate there's a potential bug with upgrade where in-cluster networking is affected for about 25s.

      This is a borderline bug to file because we have only two runs where we've found this, both within the last week, 4.17, minor upgrade, and IPv6. I can't find it in 4.16, nor 4.18.

      However the pattern is similar and we suspect this could be surfacing a real bug, so I figured it should be filed.

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.17-upgrade-from-stable-4.16-e2e-metal-ipi-upgrade-ovn-ipv6/1825161042779443200

      • in this job it looks like master-0 is the culprit, during it's upgrade, the other nodes lose connectivity to it.

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.17-upgrade-from-stable-4.16-e2e-metal-ipi-upgrade-ovn-ipv6/1824209286775967744

      • also worker-0

      If you expand the intervals on both, during node update you'll see a huge swath of disruption on lots of in-cluster backends, but only those that are "*-to-host", implying a problem with the host network. Disruption seems to all be "to" the same node, worker-0 in both cases, which is being upgraded/rebooted at that time.

      The disruption monitoring framework is supposed to handle a node going down by monitoring disruption only for hosts in an endpointslice on a service the poller creates. I'm unclear if this is a product bug or a problem with the monitoring itself. If it is a product bug, it would seem to imply that there is a problem with a node being upgraded/rebooted, but not removed from endpoint slices.

      The disruption poller code is in: https://github.com/openshift/origin/blob/master/pkg/cmd/openshift-tests/disruption/watch-endpointslice/watch_endpointslice_options.go and was written by deads.

              sdn-team-bot sdn-team bot
              rhn-engineering-dgoodwin Devan Goodwin
              Arti Sood Arti Sood
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: