Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-4714

adjust HighOverallControlPlaneCPU alert for SNO

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • openshift-4.12
    • kube-apiserver
    • False
    • None
    • False
    • Not Selected

      1. Proposed title of this feature request

      Adjust the HighOverallControlPlaneCPU alert for SNO and clarify the recommended overall CPU and memory resource usage for SNO deployments. For a deployment with 3 master nodes, the recommendation is 60%, and the HighOverallControlPlaneCPU alert is set accordingly

      2. What is the nature and description of the request?

      As per our documentation on "Control plane node sizing", for larger clusters, we indeed recommend keeping the overall CPU and memory resource usage on the control plane nodes to a maximum of 60% which is aligned to HighOverallControlPlaneCPU alert . This guideline is set to ensure that the system can handle sudden spikes in resource usage and prevent potential cascading failures but for clusters with 3 master nodes. The statement from our documentation reads: "To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes."

      However, even in SNO deployments, where one might expect different thresholds, Prometheus alert "HighOverallControlPlaneCPU" is triggered based on these thresholds.

      $ oc get infrastructure cluster -o jsonpath='

      {.status.infrastructureTopology}

      '
      SingleReplica

      $ oc get prometheusrules cpu-utilization -n openshift-kube-apiserver -o yaml | grep HighOverallControlPlaneCPU
      ...

      • alert: HighOverallControlPlaneCPU
        annotations:
        description: Given three control plane nodes, the overall CPU utilization
        may only be about 2/3 of all available capacity. This is because if a single
        control plane node fails, the remaining two must handle the load of the
        cluster in order to be HA. If the cluster is using more than 2/3 of all
        capacity, if one control plane node fails, the remaining two are likely
        to fail when they take the load. To fix this, increase the CPU and memory
        on your control plane nodes.
        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-apiserver-operator/ExtremelyHighIndividualControlPlaneCPU.md
        summary: CPU utilization across all three control plane nodes is higher than
        two control plane nodes can sustain; a single control plane node outage
        may cause a cascading failure; increase available CPU.
        expr: |
        sum(
        100 - (avg by (instance) (rate(node_cpu_seconds_total {mode="idle"}

        [1m])) * 100)
        AND on (instance) label_replace( kube_node_role

        {role="master"}, "instance", "$1", "node", "(.+)" )
        )
        /
        count(kube_node_role{role="master"}

        )
        > 60
        ...

      3. Why does the customer need this? (List the business requirements here)

      As a customer, we require a recommendation for maximum CPU/memory usage when conducting load testing on SNO as we have for a deployment with 3 master nodes

      4. List any affected packages or components.

      cluster-kube-apiserver-operator

      https://github.com/openshift/cluster-kube-apiserver-operator/blob/a66fe80f9ce087436b7bdd2e05af512892864664/bindata/assets/alerts/cpu-utilization.yaml#L10

            dfroehli42rh Daniel Fröhlich
            rhn-support-jclaretm Jorge Claret Membrado
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: