-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.18.0
-
None
-
Moderate
-
None
-
False
-
TRT disruption monitoring picked up a severe change in the disruption P95 on Azure, which turned out to all be originating from one job: periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-csi
The graph indicates the problem started on Aug 24th.
The disruption appears linked to a very long running test:
External Storage [Driver: disk.csi.azure.com] [Testpattern: Dynamic PV (filesystem volmode)] OpenShift CSI extended - SCSI LUN Overflow should use many PVs on a single node [Serial][Timeout:60m]
This test can run for up to 45 minutes in some cases and sometimes causes loss of internal networking to one host.
Sample job runs, which can be found by going to the dashboard link in the start of the description and scrolling down to job runs, looking for those with high numbers.
Expand the intervals chart to see the disruption on any run.
Outages range from 100-400 seconds, which is really quite severe. A node is going not-ready and this appears to be the one all the disruption backends that fail are hitting.
Is this expected for this test? It seems like it might be indicating a real problem.