Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.15.0
Affects Version/s: 4.15
Component/s: Etcd
Labels:
- trt-standup

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
5
Severity:
Critical
Regression:
Yes

Target Backport Versions:
None
Target Version:

4.15.0
Release Blocker:
Approved
Sprint:
ETCD Sprint 245, ETCD Sprint 246, ETCD Sprint 247
sprint_count:
3

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
Release Note Not Required
Release Note Text:
N/A

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Based on this and this component readiness data that compares success rates for those two particular tests, we are regressing ~7-10% between the current 4.15 master and 4.14.z (iow. we made the product ~10% worse).

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-upi-serial/1720630313664647168

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-serial/1719915053026643968

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-upi-serial/1721475601161785344

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-serial/1724202075631390720

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-upi-serial/1721927613917696000

These jobs and their failures are all caused by increased etcd leader elections disrupting seemingly unrelated test cases across the VSphere AMD64 platform.

Since this particular platform's business significance is high, I'm setting this as "Critical" severity.

Please get in touch with me or dwest@redhat.com if more teams need to be pulled into investigation and mitigation.

Version-Release number of selected component (if applicable):

4.15 / master

How reproducible:

Component Readiness Board

Actual results:

The etcd leader elections are elevated. Some jobs indicate it is due to disk i/o throughput OR network overload.

Expected results:

1. We NEED to understand what is causing this problem.
2. If we can mitigate this, we should.
3. If we cannot mitigate this, we need to document this or work with VSphere infrastructure provider to fix this problem.
4. We optionally need a way to measure how often this happens in our fleet so we can evaluate how bad it is.

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

grafana-4.15.png
386 kB
2023/11/28 6:24 PM
grafana-4.14.png
345 kB
2023/11/28 6:24 PM

blocks

OCPBUGS-27151 [regression] increased etcd leader elections significantly impacting vsphere amd64 platform

Closed

depends on

OCPBUGS-27094 [regression] increased etcd leader elections significantly impacting vsphere amd64 platform

Closed

is cloned by

OCPBUGS-27151 [regression] increased etcd leader elections significantly impacting vsphere amd64 platform

Closed

relates to

TRT-1436 Investigate and tune disruption alerts

Closed

TRT-1353 Investigate Sep 22 Azure OpenShift API Disruption Regression

Closed

TRT-1370 Break etcd leadership intervals out of pod logs section in chart

Closed

links to

https://github.com/openshift/oauth-apiserver/pull/96

https://github.com/openshift/openshift-apiserver/pull/413

RHSA-2023:7198 OpenShift Container Platform 4.15 security update

(1 relates to, 3 links to)

Assignee:: Lukasz Szaszkiewicz

Reporter:: Michal Fojtik (Inactive)

QA Contact:: Ke Wang

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2023/11/14 3:04 PM

Updated:: 2025/07/24 11:40 PM

Resolved:: 2024/02/27 9:06 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates