Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18549

Significant 12 minute pod-to-host disruption detected on aws ovn minor upgrades

    XMLWordPrintable

Details

    • Important
    • No
    • Approved
    • False
    • Hide

      None

      Show
      None
    • Hide
      Cause: adding a subnet mask to the annotation "k8s.ovn.org/host-addresses"

      Consequence: The nodes would update and remove the subnet mask from the annotation "k8s.ovn.org/host-addresses" which would cause issues with the master pods until they where also updated.

      Fix: create an additional annotation "k8s.ovn.org/host-cidrs", during update the nodes will leave "k8s.ovn.org/host-addresses" annotation alone and when the masters fully upgrade they will recognize the new annotation as including the subnet mask

      Result: Insignificant disruption during upgrades
      Show
      Cause: adding a subnet mask to the annotation "k8s.ovn.org/host-addresses" Consequence: The nodes would update and remove the subnet mask from the annotation "k8s.ovn.org/host-addresses" which would cause issues with the master pods until they where also updated. Fix: create an additional annotation "k8s.ovn.org/host-cidrs", during update the nodes will leave "k8s.ovn.org/host-addresses" annotation alone and when the masters fully upgrade they will recognize the new annotation as including the subnet mask Result: Insignificant disruption during upgrades
    • Bug Fix
    • In Progress

    Description

      DISCLAIMER: The code for measuring disruption in-cluster is extremely new, we cannot be 100% confident what we're seeing is real, however the below bug is demonstrating a problem that is occurring in a very specific configuration, all others are unaffected, so this helps us gain some confidence what we're seeing is real.

      https://grafana-loki.ci.openshift.org/d/ISnBj4LVk/disruption?orgId=1&var-platform=aws&var-percentile=P50&var-backend=pod-to-host-new-connections&var-releases=4.14&var-upgrade_type=minor&var-networks=sdn&var-networks=ovn&var-topologies=ha&var-architectures=amd64&var-min_job_runs=10&var-lookback=1&var-master_node_updated=Y&from=now-7d&to=now

      • affects pod-to-host-new-connections
      • affects aws minor upgrades are seeing over 14000s of disruption for the P50
      • does not affect pod-to-host-reused-connections
      • does not affect any other clouds
      • does not affect micro upgrades
      • does not affect pod-to-service or pod-to-pod backends
      • does not affect sdn

      The total disruption comes from a number of pods which are added together, the actual duration of the disruption is roughly / 14. The actual disruption appears to be about 12 minutes and hits all pods doing pod-to-host monitoring simultaneously.

      Sample job: (taken from expanding the "Most Recent Runs" panel in grafana)

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade/1698740856976052224

      In the first spyglass chart for upgrade you can see the batch of disruption: 7:28:19 - 7:40:03

      We do not have data prior to ovn interconnect landing, so we cannot say if this started at that time or not.

      Attachments

        Issue Links

          Activity

            People

              jtanenba@redhat.com Jacob Tanenbaum
              rhn-engineering-dgoodwin Devan Goodwin
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: