Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19388

Observed increase in CPU usage during E2E testing with cgroupsv2

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      GCP is the only platform that consistently fires HighOverallControlPlaneCPU and ExtremelyHighIndividualControlPlaneCPU alerts during it's CI runs.

      The problem appears mainly on gcp ovn micro because this is what we have the most data for, but it appears sdn runs have similar problems but less alert firing time, and minor upgrades also.

      Focusing on ovn micro upgrades, the P50 for all CI clusters is around 3m of HighOverall, and 8m of ExtremelyHighIndividual.

      We began looking at this while chasing runs with serious disruption, that led to log statements that seemed to indicate CPU starvation.

      GCP CI uses e2-standard-4 instances which are 4vcpu 16gb of RAM. This roughly matches AWS, but Azure uses 8vcpu 32gb RAM.

      We did some investigation in the bug that started this (OCPBUGS-18544) and found that 6vcpus will drop the firing time for these alerts to almost 0, but they do enter a pending state for reasonably long periods of time. 8vcpu and the pending states are effectively gone. More vcpus helps, but for cost reasons we were asked to loop in the apiserver team for help to see if something is wrong here or not.

      It appears the problem may have gotten worse in 4.14, the P50 has more than doubled for ExtremelyHighIndividual, from 200s to over 500s compared to 4.13. (DISCLAIMER: we cannot be 100% sure this comparison is accurate as our new numbers take whether or not the master nodes were updated during a micro upgrade into account, however given this is e2e exposed after upgrade, I think it appears the conclusion is valid, this does seem worse in 4.14)

      To find job runs experiencing high alert time, use the above dashboard links and scroll down to Most Recent Jobs runs, and click on any that high a high alert seconds. This number represents the firing time, we do not track pending in this database.

      The PromeCIeus link under debug tools should be helpful in examining prom metrics during these runs, however we have seen some corrupted files which may be related to the extremely bad runs.

      If you need metrics perhaps from one of the tests with 6 or 8 vcpu, these seem to have gotten valid prometheus data so it appears the high cpu is linked to corrupted metrics data:

      8 vcpu with valid prom data but no high cpu alerts: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/43300/rehearse-43300-periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1702716280932405248

      6 vcpu with valid prom data, high cpu alerts went pending but never firing: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/43416/rehearse-43416-periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade/1702718765097029632

      Additionally Justin Pierce may be able to help examine GCP cloud metrics for the vms if needed.

        1. 06/15 growth.png
          97 kB
          Damien Grisonnet
        2. 07/26 growth.png
          93 kB
          Damien Grisonnet
        3. 08/24.png
          95 kB
          Damien Grisonnet
        4. 2023-09-21-182034_1671x305_scrot.png
          47 kB
          Damien Grisonnet
        5. 2023-09-21-182048_1655x295_scrot.png
          45 kB
          Damien Grisonnet
        6. 2023-10-02-173152_1862x864_scrot.png
          75 kB
          Damien Grisonnet
        7. 2023-10-02-173244_1857x856_scrot.png
          110 kB
          Damien Grisonnet
        8. 2023-10-02-173438_1857x857_scrot.png
          74 kB
          Damien Grisonnet
        9. 2023-10-02-173639_1852x850_scrot.png
          71 kB
          Damien Grisonnet
        10. 2023-10-02-173639_1852x850_scrot-1.png
          71 kB
          Damien Grisonnet
        11. 2023-10-03-161106_2496x1177_scrot.png
          99 kB
          Damien Grisonnet
        12. 2023-10-03-161602_2510x1155_scrot.png
          106 kB
          Damien Grisonnet
        13. 2023-10-05-160221_2509x1252_scrot.png
          148 kB
          Damien Grisonnet
        14. 2023-10-05-160221_2509x1252_scrot-1.png
          148 kB
          Damien Grisonnet
        15. 2023-10-05-160339_2512x1260_scrot.png
          149 kB
          Damien Grisonnet
        16. 2023-10-05-160659_2507x1261_scrot.png
          240 kB
          Damien Grisonnet
        17. 2023-10-05-161653_2508x1249_scrot.png
          245 kB
          Damien Grisonnet
        18. 2023-10-09-loki.png
          393 kB
          Ryan Phillips
        19. Average cgroupsv1 during test period.png
          49 kB
          Justin Pierce
        20. Average cgroupsv2 during test period.png
          43 kB
          Justin Pierce
        21. cgroupsv1 worker mean.png
          53 kB
          Justin Pierce
        22. cgroupsv2 worker mean.png
          51 kB
          Justin Pierce
        23. full-view-4.14.png
          143 kB
          Damien Grisonnet
        24. GCP_CONSOLE_THROTTLE.png
          305 kB
          Ryan Phillips
        25. image-2023-10-17-12-05-41-999.png
          54 kB
          Justin Pierce
        26. image-2023-10-17-12-05-51-312.png
          54 kB
          Justin Pierce
        27. image-2023-10-17-12-06-19-797.png
          54 kB
          Justin Pierce
        28. image-2023-10-25-10-10-21-572.png
          199 kB
          Harshal Patil
        29. image-2023-10-25-10-10-59-627.png
          191 kB
          Harshal Patil
        30. image-2023-10-25-13-24-07-747.png
          104 kB
          Justin Pierce
        31. image-2023-10-25-13-24-49-186.png
          0.3 kB
          Justin Pierce
        32. image-2023-10-25-13-25-05-346.png
          102 kB
          Justin Pierce
        33. image-2023-10-25-13-25-16-900.png
          114 kB
          Justin Pierce
        34. image-2023-10-25-13-36-50-277.png
          64 kB
          Justin Pierce
        35. image-2023-10-25-13-37-07-748.png
          55 kB
          Justin Pierce
        36. PromeCleus-Requests.png
          405 kB
          Ryan Phillips
        37. request/s per resources on 4.13.png
          53 kB
          Damien Grisonnet
        38. requests/s per resources on 4.14.png
          52 kB
          Damien Grisonnet
        39. screenshot-1.png
          109 kB
          Ryan Phillips

              harpatil@redhat.com Harshal Patil
              rhn-engineering-dgoodwin Devan Goodwin
              Rahul Gangwar Rahul Gangwar
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

                Created:
                Updated:
                Resolved: