Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1130

Investigate etcdGRPCRequestsSlow test

    XMLWordPrintable

Details

    • Critical
    • ETCD Sprint 226
    • Rejected
    • Hide

      None

      Show
      None

    Description

      The test results in sippy look really bad on our less common platforms, but still pretty unacceptable even on core clouds. It's reasonably often the only test that fails. We need to decide what to do here, and we're going to need input from the etcd team.

      As of Sep 13th:

      • several vsphere and openstack variant combo's fail this test around 24-32% of the time
      • aws, amd64, ovn, upgrade, upgrade-micro, ha - fails 6% of the time
      • aws, amd64, ovn, upgrade, upgrade-minor, ha - fails 4% of the time
      • gcp, amd64, sdn, upgrade, upgrade-minor, ha - fails 8% of the time
      • globally across all jobs fails around 3% of the time.

      Even on some major variant combos, a 4-8% failure rate is too high.
      On Sep 13 arch call (no etcd present), Damien mentioned this might be an upstream alert that just isn't well suited for OpenShift's use cases, is this the case and it needs tuning?

      Has the problem been getting worse?

      I believe this link https://datastudio.google.com/s/urkKwmmzvgo indicates that this may be the case for 4.12, AWS and Azure are both getting worse in ways that I don't see if we change the release to 4.11 where it looks consistent. gcp seems fine on 4.12. We do not have data for vsphere for some reason.

      This link shows the grpc_methods most commonly involved: https://search.ci.openshift.org/?search=etcdGRPCRequestsSlow+was+at+or+above&maxAge=48h&context=7&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

      At a glance: LeaseGrant, MemberList, Txn, Status, Range.

      Broken out of TRT-401
      For linking with sippy:
      [bz-etcd][invariant] alert/etcdGRPCRequestsSlow should not be at or above info
      [sig-arch][bz-etcd][Late] Alerts alert/etcdGRPCRequestsSlow should not be at or above info [Suite:openshift/conformance/parallel]

       

      Attachments

        1. image-2022-10-11-16-50-17-452.png
          image-2022-10-11-16-50-17-452.png
          86 kB
        2. image-2022-10-13-16-13-53-247.png
          image-2022-10-13-16-13-53-247.png
          141 kB
        3. image-2022-10-18-12-36-45-657.png
          image-2022-10-18-12-36-45-657.png
          107 kB
        4. image-2022-10-18-18-00-58-785.png
          image-2022-10-18-18-00-58-785.png
          142 kB
        5. image-2022-10-18-18-04-50-238.png
          image-2022-10-18-18-04-50-238.png
          112 kB
        6. image-2022-10-18-18-14-21-383.png
          image-2022-10-18-18-14-21-383.png
          85 kB
        7. image-2022-10-18-18-21-02-299.png
          image-2022-10-18-18-21-02-299.png
          79 kB
        8. image-2022-10-20-11-44-58-462.png
          image-2022-10-20-11-44-58-462.png
          146 kB
        9. image-2022-10-20-11-49-02-588.png
          image-2022-10-20-11-49-02-588.png
          81 kB
        10. image-2022-10-20-11-57-26-687.png
          image-2022-10-20-11-57-26-687.png
          119 kB
        11. image-2022-10-20-12-10-33-812.png
          image-2022-10-20-12-10-33-812.png
          369 kB
        12. image-2022-10-20-12-31-03-538.png
          image-2022-10-20-12-31-03-538.png
          176 kB
        13. screenshot-1.png
          screenshot-1.png
          110 kB
        14. screenshot-2.png
          screenshot-2.png
          108 kB

        Issue Links

          Activity

            People

              rhn-engineering-dgoodwin Devan Goodwin
              rhn-engineering-dgoodwin Devan Goodwin
              Devan Goodwin Devan Goodwin
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: