Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.12.0
Affects Version/s: 4.12
Component/s: Etcd, Test Framework
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
None

Target Backport Versions:
None
Target Version:

4.12.z
Release Blocker:
Rejected
Sprint:
ETCD Sprint 226
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

The test results in sippy look really bad on our less common platforms, but still pretty unacceptable even on core clouds. It's reasonably often the only test that fails. We need to decide what to do here, and we're going to need input from the etcd team.

As of Sep 13th:

several vsphere and openstack variant combo's fail this test around 24-32% of the time
aws, amd64, ovn, upgrade, upgrade-micro, ha - fails 6% of the time
aws, amd64, ovn, upgrade, upgrade-minor, ha - fails 4% of the time
gcp, amd64, sdn, upgrade, upgrade-minor, ha - fails 8% of the time
globally across all jobs fails around 3% of the time.

Even on some major variant combos, a 4-8% failure rate is too high.
On Sep 13 arch call (no etcd present), Damien mentioned this might be an upstream alert that just isn't well suited for OpenShift's use cases, is this the case and it needs tuning?

Has the problem been getting worse?

I believe this link https://datastudio.google.com/s/urkKwmmzvgo indicates that this may be the case for 4.12, AWS and Azure are both getting worse in ways that I don't see if we change the release to 4.11 where it looks consistent. gcp seems fine on 4.12. We do not have data for vsphere for some reason.

This link shows the grpc_methods most commonly involved: https://search.ci.openshift.org/?search=etcdGRPCRequestsSlow+was+at+or+above&maxAge=48h&context=7&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

At a glance: LeaseGrant, MemberList, Txn, Status, Range.

Broken out of ~~TRT-401~~
For linking with sippy:
[bz-etcd][invariant] alert/etcdGRPCRequestsSlow should not be at or above info
[sig-arch][bz-etcd][Late] Alerts alert/etcdGRPCRequestsSlow should not be at or above info [Suite:openshift/conformance/parallel]

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2022-10-11-16-50-17-452.png
86 kB
2022/10/11 2:50 PM
image-2022-10-13-16-13-53-247.png
141 kB
2022/10/13 2:13 PM
image-2022-10-18-12-36-45-657.png
107 kB
2022/10/18 10:36 AM
image-2022-10-18-18-00-58-785.png
142 kB
2022/10/18 4:00 PM
image-2022-10-18-18-04-50-238.png
112 kB
2022/10/18 4:04 PM
image-2022-10-18-18-14-21-383.png
85 kB
2022/10/18 4:14 PM
image-2022-10-18-18-21-02-299.png
79 kB
2022/10/18 4:21 PM
image-2022-10-20-11-44-58-462.png
146 kB
2022/10/20 9:44 AM
image-2022-10-20-11-49-02-588.png
81 kB
2022/10/20 9:49 AM
image-2022-10-20-11-57-26-687.png
119 kB
2022/10/20 9:57 AM
image-2022-10-20-12-10-33-812.png
369 kB
2022/10/20 10:10 AM
image-2022-10-20-12-31-03-538.png
176 kB
2022/10/20 10:31 AM
screenshot-1.png
110 kB
2022/09/20 3:04 PM
screenshot-2.png
108 kB
2022/09/21 1:10 PM

blocks

OCPBUGS-1607 [4.11] Investigate etcdGRPCRequestsSlow test

Closed

is blocked by

OCPBUGS-2604 Increase ControlPlane disk IOPS on GCP

Closed

is cloned by

OCPBUGS-1607 [4.11] Investigate etcdGRPCRequestsSlow test

Closed

is related to

OCPBUGS-1121 The alert "etcdGRPCRequestsSlow" fires in CI

Closed

links to

openshift/cluster-etcd-operator#932: OCPBUGS-1130: increase etcdGRPCRequestsSlow thresholds

openshift/release#33248: OCPBUGS-1130: move gcp CP disks to 200gigs ssd

openshift/runbooks#69: OCPBUGS-1130: increase etcdGRPCRequestsSlow thresholds

(2 links to)

Assignee:: Devan Goodwin

Reporter:: Devan Goodwin

QA Contact:: Devan Goodwin

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2022/07/21 1:51 PM

Updated:: 2025/10/29 1:10 PM

Resolved:: 2024/11/22 2:55 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates