Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19388

Observed increase in CPU usage during E2E testing with cgroupsv2

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      GCP is the only platform that consistently fires HighOverallControlPlaneCPU and ExtremelyHighIndividualControlPlaneCPU alerts during it's CI runs.

      The problem appears mainly on gcp ovn micro because this is what we have the most data for, but it appears sdn runs have similar problems but less alert firing time, and minor upgrades also.

      Focusing on ovn micro upgrades, the P50 for all CI clusters is around 3m of HighOverall, and 8m of ExtremelyHighIndividual.

      We began looking at this while chasing runs with serious disruption, that led to log statements that seemed to indicate CPU starvation.

      GCP CI uses e2-standard-4 instances which are 4vcpu 16gb of RAM. This roughly matches AWS, but Azure uses 8vcpu 32gb RAM.

      We did some investigation in the bug that started this (OCPBUGS-18544) and found that 6vcpus will drop the firing time for these alerts to almost 0, but they do enter a pending state for reasonably long periods of time. 8vcpu and the pending states are effectively gone. More vcpus helps, but for cost reasons we were asked to loop in the apiserver team for help to see if something is wrong here or not.

      It appears the problem may have gotten worse in 4.14, the P50 has more than doubled for ExtremelyHighIndividual, from 200s to over 500s compared to 4.13. (DISCLAIMER: we cannot be 100% sure this comparison is accurate as our new numbers take whether or not the master nodes were updated during a micro upgrade into account, however given this is e2e exposed after upgrade, I think it appears the conclusion is valid, this does seem worse in 4.14)

      To find job runs experiencing high alert time, use the above dashboard links and scroll down to Most Recent Jobs runs, and click on any that high a high alert seconds. This number represents the firing time, we do not track pending in this database.

      The PromeCIeus link under debug tools should be helpful in examining prom metrics during these runs, however we have seen some corrupted files which may be related to the extremely bad runs.

      If you need metrics perhaps from one of the tests with 6 or 8 vcpu, these seem to have gotten valid prometheus data so it appears the high cpu is linked to corrupted metrics data:

      8 vcpu with valid prom data but no high cpu alerts: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/43300/rehearse-43300-periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1702716280932405248

      6 vcpu with valid prom data, high cpu alerts went pending but never firing: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/43416/rehearse-43416-periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade/1702718765097029632

      Additionally Justin Pierce may be able to help examine GCP cloud metrics for the vms if needed.

        1. screenshot-1.png
          screenshot-1.png
          109 kB
        2. requests/s per resources on 4.14.png
          requests/s per resources on 4.14.png
          52 kB
        3. request/s per resources on 4.13.png
          request/s per resources on 4.13.png
          53 kB
        4. PromeCleus-Requests.png
          PromeCleus-Requests.png
          405 kB
        5. image-2023-10-25-13-37-07-748.png
          image-2023-10-25-13-37-07-748.png
          55 kB
        6. image-2023-10-25-13-36-50-277.png
          image-2023-10-25-13-36-50-277.png
          64 kB
        7. image-2023-10-25-13-25-16-900.png
          image-2023-10-25-13-25-16-900.png
          114 kB
        8. image-2023-10-25-13-25-05-346.png
          image-2023-10-25-13-25-05-346.png
          102 kB
        9. image-2023-10-25-13-24-49-186.png
          image-2023-10-25-13-24-49-186.png
          0.3 kB
        10. image-2023-10-25-13-24-07-747.png
          image-2023-10-25-13-24-07-747.png
          104 kB
        11. image-2023-10-25-10-10-59-627.png
          image-2023-10-25-10-10-59-627.png
          191 kB
        12. image-2023-10-25-10-10-21-572.png
          image-2023-10-25-10-10-21-572.png
          199 kB
        13. image-2023-10-17-12-06-19-797.png
          image-2023-10-17-12-06-19-797.png
          54 kB
        14. image-2023-10-17-12-05-51-312.png
          image-2023-10-17-12-05-51-312.png
          54 kB
        15. image-2023-10-17-12-05-41-999.png
          image-2023-10-17-12-05-41-999.png
          54 kB
        16. GCP_CONSOLE_THROTTLE.png
          GCP_CONSOLE_THROTTLE.png
          305 kB
        17. full-view-4.14.png
          full-view-4.14.png
          143 kB
        18. cgroupsv2 worker mean.png
          cgroupsv2 worker mean.png
          51 kB
        19. cgroupsv2_investigation.zip
          63 kB
        20. cgroupsv1 worker mean.png
          cgroupsv1 worker mean.png
          53 kB
        21. Average cgroupsv2 during test period.png
          Average cgroupsv2 during test period.png
          43 kB
        22. Average cgroupsv1 during test period.png
          Average cgroupsv1 during test period.png
          49 kB
        23. all_workers_usage_cgroups_compare.zip
          30 kB
        24. 2023-10-09-loki.png
          2023-10-09-loki.png
          393 kB
        25. 2023-10-05-161653_2508x1249_scrot.png
          2023-10-05-161653_2508x1249_scrot.png
          245 kB
        26. 2023-10-05-160659_2507x1261_scrot.png
          2023-10-05-160659_2507x1261_scrot.png
          240 kB
        27. 2023-10-05-160339_2512x1260_scrot.png
          2023-10-05-160339_2512x1260_scrot.png
          149 kB
        28. 2023-10-05-160221_2509x1252_scrot-1.png
          2023-10-05-160221_2509x1252_scrot-1.png
          148 kB
        29. 2023-10-05-160221_2509x1252_scrot.png
          2023-10-05-160221_2509x1252_scrot.png
          148 kB
        30. 2023-10-03-161602_2510x1155_scrot.png
          2023-10-03-161602_2510x1155_scrot.png
          106 kB
        31. 2023-10-03-161106_2496x1177_scrot.png
          2023-10-03-161106_2496x1177_scrot.png
          99 kB
        32. 2023-10-02-173639_1852x850_scrot-1.png
          2023-10-02-173639_1852x850_scrot-1.png
          71 kB
        33. 2023-10-02-173639_1852x850_scrot.png
          2023-10-02-173639_1852x850_scrot.png
          71 kB
        34. 2023-10-02-173438_1857x857_scrot.png
          2023-10-02-173438_1857x857_scrot.png
          74 kB
        35. 2023-10-02-173244_1857x856_scrot.png
          2023-10-02-173244_1857x856_scrot.png
          110 kB
        36. 2023-10-02-173152_1862x864_scrot.png
          2023-10-02-173152_1862x864_scrot.png
          75 kB
        37. 2023-09-21-182048_1655x295_scrot.png
          2023-09-21-182048_1655x295_scrot.png
          45 kB
        38. 2023-09-21-182034_1671x305_scrot.png
          2023-09-21-182034_1671x305_scrot.png
          47 kB
        39. 08/24.png
          08/24.png
          95 kB
        40. 07/26 growth.png
          07/26 growth.png
          93 kB
        41. 06/15 growth.png
          06/15 growth.png
          97 kB

              harpatil@redhat.com Harshal Patil
              rhn-engineering-dgoodwin Devan Goodwin
              Rahul Gangwar Rahul Gangwar
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

                Created:
                Updated:
                Resolved: