Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.16.0
Affects Version/s: 4.16.0
Component/s: Test Framework
Labels:
- component-regression

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.16.z
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Component Readiness has found a potential regression in [sig-arch][Early] CRDs for openshift.io should have subresource.status [Suite:openshift/conformance/parallel].

Probability of significant regression: 98.48%

Sample (being evaluated) Release: 4.16
Start Time: 2024-03-21T00:00:00Z
End Time: 2024-03-27T23:59:59Z
Success Rate: 89.29%
Successes: 25
Failures: 3
Flakes: 0

Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 99.28%
Successes: 138
Failures: 1
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2024-02-28%2023%3A59%3A59&baseRelease=4.15&baseStartTime=2024-02-01%2000%3A00%3A00&capability=Other&component=Unknown&confidence=95&environment=ovn%20no-upgrade%20amd64%20metal-ipi%20serial&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=metal-ipi&platform=metal-ipi&sampleEndTime=2024-03-27%2023%3A59%3A59&sampleRelease=4.16&sampleStartTime=2024-03-21%2000%3A00%3A00&testId=openshift-tests%3Ab3e170673c14c432c14836e9f41e7285&testName=%5Bsig-arch%5D%5BEarly%5D%20CRDs%20for%20openshift.io%20should%20have%20subresource.status%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D&upgrade=no-upgrade&upgrade=no-upgrade&variant=serial&variant=serial

In examining these test failures we found what is actually a pretty random grouping of tests failing and likely as a group these are a fairly significant part of why component readiness is reporting so much red on metal right now.

In March the metal team modified some configuration such that a portion of metal jobs can now land in a couple new environments, one of them ibmcloud.

This linked test above helped find the pattern whereby we can open the spyglass chart in prow and see a clear pattern that we then found in many other failed metal jobs:

pod-logs section full of etcd logging problems where reads and writes took too long
a vertical line of disruption across multiple backends
an abnormal vertical line of etcd leader elections jumping around
a vertical line of failed e2e tests

All of these line up within the same vertical space indicating the problem was at the same time, and the pod-logs section is as full as ever.

dhiggins@redhat.com has pulled ibmcloud out of rotation until they can attempt some SSD for etcd.

This bug is for introduction of a test that will make this symptom of etcd being very unhealthy visible as a test failure, both to communicate to engineers who look at the runs and help them understand this critical failure, and to help us locate runs affected because no single existing test can really do this today.

Azure and GCP jobs can normally log these etcd warnings 3-5k times in a CI run. These ibmcloud runs were showing 30-70k. A limit of 10k was chosen based on examining the data in bigquery, only 50 jobs have exceeded that this month, all metal and agent jobs.

links to

openshift/origin#28674: OCPBUGS-31492: Add a test that will fail on over 10k etcd took too long messages

Assignee:: Devan Goodwin

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/03/28 11:49 AM

Updated:: 2025/07/23 5:46 AM

Resolved:: 2024/08/06 3:24 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates