Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31492

Subset of Metal jobs have insufficient etcd disk I/O

    XMLWordPrintable

Details

    • False
    • Hide

      None

      Show
      None

    Description

      Component Readiness has found a potential regression in [sig-arch][Early] CRDs for openshift.io should have subresource.status [Suite:openshift/conformance/parallel].

      Probability of significant regression: 98.48%

      Sample (being evaluated) Release: 4.16
      Start Time: 2024-03-21T00:00:00Z
      End Time: 2024-03-27T23:59:59Z
      Success Rate: 89.29%
      Successes: 25
      Failures: 3
      Flakes: 0

      Base (historical) Release: 4.15
      Start Time: 2024-02-01T00:00:00Z
      End Time: 2024-02-28T23:59:59Z
      Success Rate: 99.28%
      Successes: 138
      Failures: 1
      Flakes: 0

      View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2024-02-28%2023%3A59%3A59&baseRelease=4.15&baseStartTime=2024-02-01%2000%3A00%3A00&capability=Other&component=Unknown&confidence=95&environment=ovn%20no-upgrade%20amd64%20metal-ipi%20serial&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=metal-ipi&platform=metal-ipi&sampleEndTime=2024-03-27%2023%3A59%3A59&sampleRelease=4.16&sampleStartTime=2024-03-21%2000%3A00%3A00&testId=openshift-tests%3Ab3e170673c14c432c14836e9f41e7285&testName=%5Bsig-arch%5D%5BEarly%5D%20CRDs%20for%20openshift.io%20should%20have%20subresource.status%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D&upgrade=no-upgrade&upgrade=no-upgrade&variant=serial&variant=serial

      In examining these test failures we found what is actually a pretty random grouping of tests failing and likely as a group these are a fairly significant part of why component readiness is reporting so much red on metal right now.

      In March the metal team modified some configuration such that a portion of metal jobs can now land in a couple new environments, one of them ibmcloud.

      This linked test above helped find the pattern whereby we can open the spyglass chart in prow and see a clear pattern that we then found in many other failed metal jobs:

      • pod-logs section full of etcd logging problems where reads and writes took too long
      • a vertical line of disruption across multiple backends
      • an abnormal vertical line of etcd leader elections jumping around
      • a vertical line of failed e2e tests

      All of these line up within the same vertical space indicating the problem was at the same time, and the pod-logs section is as full as ever.

      dhiggins@redhat.com has pulled ibmcloud out of rotation until they can attempt some SSD for etcd.

      This bug is for introduction of a test that will make this symptom of etcd being very unhealthy visible as a test failure, both to communicate to engineers who look at the runs and help them understand this critical failure, and to help us locate runs affected because no single existing test can really do this today.

      Azure and GCP jobs can normally log these etcd warnings 3-5k times in a CI run. These ibmcloud runs were showing 30-70k. A limit of 10k was chosen based on examining the data in bigquery, only 50 jobs have exceeded that this month, all metal and agent jobs.

      Attachments

        Activity

          People

            rhn-engineering-dgoodwin Devan Goodwin
            rhn-engineering-dgoodwin Devan Goodwin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: