Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Related to a component regression we found that looked like we had no clear test to catch, sample runs:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-kube-apiserver-rollout/1827763939853733888

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-ipv4/1826908352773361664

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-dualstack/1828844069434953728

All three runs show a pattern. The actual test failures look unpredictable, some tests are passing at the same time, others fail to talk to the apiserver.

The pattern we see is 1 or more tests failing right at the start of e2e testing, disruption, etcd log messages indicating slowness, and etcd leadership state changes.

Because the tests are unpredictable, we'd like a test that catches this symptom. We think the safest way to do this is to look for disruption within x minutes of the first e2e test.

This would be implemented as a monitortest, likely somewhere around here: https://github.com/openshift/origin/blob/master/pkg/monitortests/kubeapiserver/legacykubeapiservermonitortests/monitortest.go

Although it would be reasonable to add a new monitortest in the parent package above this level.

The test would need to do the following:

scan final intervals for the earliest interval with source=SourceE2ETest (constant in monitorapi/types.go), save it's start time
scan final intervals for those with source=SourceDisruption, and reason=DisruptionBegan, and a backend matching one of the apiservers (kube, openshift, oauth)
flake the test (return a failure junit result + a success junit result) if we see any SourceDisruption intervals within X minutes of that first e2e test.
Choose X based on what we see in the above links.

links to

openshift/origin#29061: TRT-1794: monitor for api disruption in early E2E tests

Assignee:: Luke Meyer

Reporter:: Devan Goodwin

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2024/08/30 2:36 PM

Updated:: 2024/09/06 1:07 PM

Resolved:: 2024/09/06 1:07 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates