Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63149

must-gather gather_metrics high cpu / disruption failures

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • 4.21.0
    • 4.21.0
    • Monitoring
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      We have been tracking disruption in 4.21-e2e-agent-ha-dualstack-conformance jobs and have traced it back to MON-4290: add test for must-gather gather_metrics.

      Most concerning are the disruption - monitoring failures - node reboots - high cpu alerts noted in the intervals

      Observing the cpu usage during the test show high usage as well as outages.

      topk(25,
        sum by (namespace) (
          rate(container_cpu_usage_seconds_total{container!="",pod!="",namespace=~".*must-gather.*|.*monitoring.*"}[5m])
        )
      )
      

      This test also regularly flakes

      We want to revert the test while it is reworked to evaluate the impact on other jobs / tests and either address the CPU issues or limit the impact to not cause disruption / test failures.  It doesn't appear to be a clean revert so either skipping the test entirely or just for the most impacted metal jobs is an alternative but we would like something done to address the disruption / failures quickly while longer term fixes are considered.

              rh-ee-amrini Ayoub Mrini
              rh-ee-fbabcock Forrest Babcock
              None
              None
              Junqi Zhao Junqi Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: