Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63490

ARO-HCP: Cluster-api pod CPU usage frequently spikes

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      During a recent perf and scale  testing on a 250-node ARO-HCP cluster, the cluster-api-controller-manager pod was observed consuming unusually high CPU resources even at idle state(no guest workload), sharp spike every 10 minutes. Causes noisy CPU throttling and potential performance degradation in other control plane components.
      
      Frequent CPU spikes, which is significantly higher than expected for steady-state control plane operations. From the logs it is correlated with large-scale reconciliation events and periodic machine health check is causing this even when no major scaling actions are taking place.
          

      Version-Release number of selected component (if applicable):

      4.19.z
          

      How reproducible:

      Always at this scale
          

      Steps to Reproduce:

          1. Create a ARO-HCP cluster with 250 nodes at least
          2. Watch container cpu usage of cluster-api pod from HCP namespace - 'container_cpu_usage_seconds_total'
          3.
          

      Actual results:

      Frequent CPU Spikes
          

      Expected results:

      Comparable CPU usage(in ROSA-HCP, usage is > a core)
          

      Additional info:

      Link to node level usage screenshots https://drive.google.com/drive/folders/1-1l5xhGMrTqJvLfdpV79ExTNmPQoWRA3

      At 500 nodes the consumption is over 14 cores on the shared MC worker.

      Logs shows frequent health check,

      I1022 16:42:44.351263       1 machinehealthcheck_targets.go:326] "Health checking target" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="ocm-arohcpprod-2m38p19lqrvda3v1lr0mn0jo0ecv2fke-aro-250/aro-250-np-static-1" namespace="ocm-arohcpprod-2m38p19lqrvda3v1lr0mn0jo0ecv2fke-aro-250" name="aro-250-np-static-1" reconcileID="f8e75ce6-a6b2-4099-8ec7-be6a6ef94370" Cluster="ocm-arohcpprod-2m38p19lqrvda3v1lr0mn0jo0ecv2fke-aro-250/2m38p19lqrvda3v1lr0mn0jo0ecv2fke" Machine="ocm-arohcpprod-2m38p19lqrvda3v1lr0mn0jo0ecv2fke-aro-250/aro-250-np-static-1-l76tf-xzgpq" Node="aro-250-np-static-1-l76tf-xzgpq"
      

              Unassigned Unassigned
              mukrishn@redhat.com Murali Krishnasamy
              None
              None
              He Liu He Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: