Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27101

[regression] increased etcd leader elections significantly impacting vsphere amd64 platform

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done-Errata
    • Critical
    • 4.15.0
    • 4.16.0
    • Etcd
    • Critical
    • No
    • Proposed
    • False
    • Hide

      None

      Show
      None

    Description

      This is a clone of issue OCPBUGS-27094. The following is the description of the original issue:

      Description of problem:

      Based on this and this component readiness data that compares success rates for those two particular tests, we are regressing ~7-10% between the current 4.15 master and 4.14.z (iow. we made the product ~10% worse).

       

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-upi-serial/1720630313664647168

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-serial/1719915053026643968

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-upi-serial/1721475601161785344

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-serial/1724202075631390720

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-upi-serial/1721927613917696000

      These jobs and their failures are all caused by increased etcd leader elections disrupting seemingly unrelated test cases across the VSphere AMD64 platform.

      Since this particular platform's business significance is high, I'm setting this as "Critical" severity.

      Please get in touch with me or dwest@redhat.com if more teams need to be pulled into investigation and mitigation.

       

      Version-Release number of selected component (if applicable):

      4.15 / master

      How reproducible:

      Component Readiness Board

      Actual results:

      The etcd leader elections are elevated. Some jobs indicate it is due to disk i/o throughput OR network overload. 

      Expected results:

      1. We NEED to understand what is causing this problem.
      2. If we can mitigate this, we should.
      3. If we cannot mitigate this, we need to document this or work with VSphere infrastructure provider to fix this problem.
      4. We optionally need a way to measure how often this happens in our fleet so we can evaluate how bad it is.

      Additional info:

       

      Attachments

        Issue Links

          Activity

            People

              dwest@redhat.com Dean West
              openshift-crt-jira-prow OpenShift Prow Bot
              Ke Wang Ke Wang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: