-
Bug
-
Resolution: Unresolved
-
Undefined
-
4.21.0
We have been tracking disruption in 4.21-e2e-agent-ha-dualstack-conformance jobs and have traced it back to MON-4290: add test for must-gather gather_metrics.
Most concerning are the disruption - monitoring failures - node reboots - high cpu alerts noted in the intervals
Observing the cpu usage during the test show high usage as well as outages.
topk(25, sum by (namespace) ( rate(container_cpu_usage_seconds_total{container!="",pod!="",namespace=~".*must-gather.*|.*monitoring.*"}[5m]) ) )
This test also regularly flakes
We want to revert the test while it is reworked to evaluate the impact on other jobs / tests and either address the CPU issues or limit the impact to not cause disruption / test failures. It doesn't appear to be a clean revert so either skipping the test entirely or just for the most impacted metal jobs is an alternative but we would like something done to address the disruption / failures quickly while longer term fixes are considered.