Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1130

Investigate etcdGRPCRequestsSlow test

XMLWordPrintable

    • Critical
    • ETCD Sprint 226
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      The test results in sippy look really bad on our less common platforms, but still pretty unacceptable even on core clouds. It's reasonably often the only test that fails. We need to decide what to do here, and we're going to need input from the etcd team.

      As of Sep 13th:

      • several vsphere and openstack variant combo's fail this test around 24-32% of the time
      • aws, amd64, ovn, upgrade, upgrade-micro, ha - fails 6% of the time
      • aws, amd64, ovn, upgrade, upgrade-minor, ha - fails 4% of the time
      • gcp, amd64, sdn, upgrade, upgrade-minor, ha - fails 8% of the time
      • globally across all jobs fails around 3% of the time.

      Even on some major variant combos, a 4-8% failure rate is too high.
      On Sep 13 arch call (no etcd present), Damien mentioned this might be an upstream alert that just isn't well suited for OpenShift's use cases, is this the case and it needs tuning?

      Has the problem been getting worse?

      I believe this link https://datastudio.google.com/s/urkKwmmzvgo indicates that this may be the case for 4.12, AWS and Azure are both getting worse in ways that I don't see if we change the release to 4.11 where it looks consistent. gcp seems fine on 4.12. We do not have data for vsphere for some reason.

      This link shows the grpc_methods most commonly involved: https://search.ci.openshift.org/?search=etcdGRPCRequestsSlow+was+at+or+above&maxAge=48h&context=7&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

      At a glance: LeaseGrant, MemberList, Txn, Status, Range.

      Broken out of TRT-401
      For linking with sippy:
      [bz-etcd][invariant] alert/etcdGRPCRequestsSlow should not be at or above info
      [sig-arch][bz-etcd][Late] Alerts alert/etcdGRPCRequestsSlow should not be at or above info [Suite:openshift/conformance/parallel]

       

        1. screenshot-2.png
          screenshot-2.png
          108 kB
        2. screenshot-1.png
          screenshot-1.png
          110 kB
        3. image-2022-10-20-12-31-03-538.png
          image-2022-10-20-12-31-03-538.png
          176 kB
        4. image-2022-10-20-12-10-33-812.png
          image-2022-10-20-12-10-33-812.png
          369 kB
        5. image-2022-10-20-11-57-26-687.png
          image-2022-10-20-11-57-26-687.png
          119 kB
        6. image-2022-10-20-11-49-02-588.png
          image-2022-10-20-11-49-02-588.png
          81 kB
        7. image-2022-10-20-11-44-58-462.png
          image-2022-10-20-11-44-58-462.png
          146 kB
        8. image-2022-10-18-18-21-02-299.png
          image-2022-10-18-18-21-02-299.png
          79 kB
        9. image-2022-10-18-18-14-21-383.png
          image-2022-10-18-18-14-21-383.png
          85 kB
        10. image-2022-10-18-18-04-50-238.png
          image-2022-10-18-18-04-50-238.png
          112 kB
        11. image-2022-10-18-18-00-58-785.png
          image-2022-10-18-18-00-58-785.png
          142 kB
        12. image-2022-10-18-12-36-45-657.png
          image-2022-10-18-12-36-45-657.png
          107 kB
        13. image-2022-10-13-16-13-53-247.png
          image-2022-10-13-16-13-53-247.png
          141 kB
        14. image-2022-10-11-16-50-17-452.png
          image-2022-10-11-16-50-17-452.png
          86 kB

            rhn-engineering-dgoodwin Devan Goodwin
            rhn-engineering-dgoodwin Devan Goodwin
            Devan Goodwin Devan Goodwin
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated: