-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
False
-
None
-
False
-
-
In TRT-1514 we researched disruption to our liveness endpoints (outside of the clusters under test) grouped by build clusters.
A spreadsheet was created to track results from versions of queries like:
SELECT JobRunDate, Cluster, ClusterType, AverageDisruption, MaxDisruption, TotalDisruption FROM ( SELECT AVG(DisruptionSeconds) OVER(PARTITION BY EXTRACT(DATE FROM JobRunStartTime), Cluster) as AverageDisruption, Max(DisruptionSeconds) OVER(PARTITION BY EXTRACT(DATE FROM JobRunStartTime), Cluster) as MaxDisruption, SUM(DisruptionSeconds) OVER(PARTITION BY EXTRACT(DATE FROM JobRunStartTime), Cluster) as TotalDisruption, Cluster, EXTRACT(DATE FROM JobRunStartTime) as JobRunDate, CASE WHEN Cluster = 'build01' THEN 'AWS' WHEN Cluster = 'build02' THEN 'GCP' WHEN Cluster = 'build03' THEN 'AWS' WHEN Cluster = 'build04' THEN 'GCP' WHEN Cluster = 'build05' THEN 'AWS' WHEN Cluster = 'vsphere02' THEN 'VSPHERE' ELSE 'UNKNOWN' END AS ClusterType FROM `openshift-ci-data-analysis.ci_data.UnifiedBackendDisruption` where BackendName="ci-cluster-network-liveness-new-connections" AND JobRunStartTime > TIMESTAMP("2023-08-01 00:00:01+00") AND JobRunStartTime < TIMESTAMP("2024-02-20 00:00:01+00") ) GROUP BY Cluster, JobRunDate, AverageDisruption, MaxDisruption, TotalDisruption, ClusterType ORDER BY JobRunDate DESC, TotalDisruption DESC
We want to formalize these queries as a view for the current and incoming liveness probes to be able to analyze how each build cluster is impacted and dig deeper into trends. Once we have the view created we will likely want to setup a new grafana dashboard to help investigate.
- relates to
-
TRT-1514 Mass 35s network outage
- Closed