Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-29797

Excessive node status updates causing high control plane CPU

XMLWordPrintable

    • No
    • MCO Sprint 249, MCO Sprint 250
    • 2
    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the `nodeStatusReportFrequency` was linked to the `nodeStatusUpdateFrequency`. With this release, the `nodeStatusReportFrequency` is set to 5 minutes. (link:https://issues.redhat.com/browse/OCPBUGS-29797[*OCPBUGS-29797*])

      ____________________
      When nodeStatusUpdateFrequency's default changed from 0s to 10s in the MCO, we inadvertently caused the nodeStatusReportFrequency to increase as the logic to set its higher value was linked to the 0s of the nodeStatusReportFrequency. We need to manually set nodeStatusReportFrequency to 5m
      Show
      * Previously, the `nodeStatusReportFrequency` was linked to the `nodeStatusUpdateFrequency`. With this release, the `nodeStatusReportFrequency` is set to 5 minutes. (link: https://issues.redhat.com/browse/OCPBUGS-29797 [* OCPBUGS-29797 *]) ____________________ When nodeStatusUpdateFrequency's default changed from 0s to 10s in the MCO, we inadvertently caused the nodeStatusReportFrequency to increase as the logic to set its higher value was linked to the 0s of the nodeStatusReportFrequency. We need to manually set nodeStatusReportFrequency to 5m
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-29713. The following is the description of the original issue:

      Description of problem:

      OCPBUGS-29424 revealed that setting the node status update frequency in kubelet (introduced with OCPBUGS-15583) causes a lot of control plane CPU. 
      
      The reason is the increased frequency of kubelet node status updates will trigger second order effects in all control plane operators that usually trigger on node changes (api server, etcd, PDB guard pod controllers, or any other static pod based machinery).
      
      Reverting the code in OCPBUGS-15583, or manually setting the report/status frequency to 0s causes the CPU to drop immediately. 

      Version-Release number of selected component (if applicable):

      any version that OCPBUGS-15583 was backported to, 4.16 down to 4.11 AFAIU

      How reproducible:

      always    

      Steps to Reproduce:

      1. create a cluster that contains a fix for OCPBUGS-15583
      2. observe the apiserver metrics (eg rate(apiserver_request_total[5m])), those should show abnormal values for pod/configmap GET
          alternatively the rate of node updates is increaed (rate(apiserver_request_total{resource="nodes", subresource="status", verb="PATCH"}[1m])) 
      
           

      Actual results:

      the node status updates every 10s, which causes high CPU usage on control plane operators and apiserver

      Expected results:

      the node status should not update that frequently, meaning the control plane CPU usage should go down again 

      Additional info:

      slack thread with the node team:
      https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1708429189987849
          

              team-mco Team MCO
              openshift-crt-jira-prow OpenShift Prow Bot
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: