Uploaded image for project: 'Machine Config Operator'
  1. Machine Config Operator
  2. MCO-1094

Impact Excessive node status updates causing high control plane CPU

XMLWordPrintable

    • Icon: Spike Spike
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • False
    • None
    • False
    • 0
    • 0

      This is the impact statement for the OCPBUGS-29713 series:

      Which 4.y.z to 4.y'.z' updates increase vulnerability?

      • Any upgrade to versions 4.12.51, 4.13.33 through 4.13.36, 4.14.12 through 4.14.15, and 4.15.0 through 4.15.1

      Which types of clusters?

      • Clusters where the control plane may not tolerate additional load

      What is the impact? Is it serious enough to warrant removing update recommendations?

      • This bug causes nodes to report their status every 15 seconds rather than every 5 minutes when no changes are being made
      • In a cluster with 15 total nodes and a nominal workload this results in approximately a 20-30% increase in API Server Requests Per Second
      • If the control plane is not scaled to tolerate that additional load abnormal behavior may arise, in one observed occurrence Service Account Tokens were not being authorized and pods requiring them crashlooped

      How involved is remediation?

      • Upgrade to a fixed version, 4.12.53, 4.13.37, 4.14.16, and 4.15.2 or later have resolved issue
      • If you do not wish to upgrade you can apply a custom Kubelet config, however since this change also requires a rolling reboot it's preferred to just upgrade instead.
        • The kubelet config value to add is `nodeStatusReportFrequency: 5m`

      Is this a regression?

      • Yes, a Machine Config Operator change meant to avoid configuration changes from triggering an unnecessary reboot caused this value to revert to its default of 15s

       

            rhn-support-sdodson Scott Dodson
            trking W. Trevor King
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: