Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.18.0
Component/s: Storage
Labels:
None

Severity:
Moderate
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

TRT disruption monitoring picked up a severe change in the disruption P95 on Azure, which turned out to all be originating from one job: periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-csi

The graph indicates the problem started on Aug 24th.

The disruption appears linked to a very long running test:

External Storage [Driver: disk.csi.azure.com] [Testpattern: Dynamic PV (filesystem volmode)] OpenShift CSI extended - SCSI LUN Overflow should use many PVs on a single node [Serial][Timeout:60m]

This test can run for up to 45 minutes in some cases and sometimes causes loss of internal networking to one host.

Sample job runs, which can be found by going to the dashboard link in the start of the description and scrolling down to job runs, looking for those with high numbers.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-csi/1838361911465349120

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-csi/1832547132029014016

Expand the intervals chart to see the disruption on any run.

Outages range from 100-400 seconds, which is really quite severe. A node is going not-ready and this appears to be the one all the disruption backends that fail are hitting.

Is this expected for this test? It seems like it might be indicating a real problem.

Assignee:: Jan Safranek

Reporter:: Devan Goodwin

QA Contact:: Wei Duan

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/09/24 1:44 PM

Updated:: 2024/09/30 11:47 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates