Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-52169

Wrong workload partitioning resource accounting can render up to 2 CPUs unusable on MNO worker nodes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • 4.20.0
    • 4.19.z
    • Node / Kubelet
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • Done
    • Release Note Not Required
    • N/A
    • None
    • None
    • None
    • None

      Description of the problem

      On a multi-node OpenShift cluster with workload partitioning, two pods coredns-<worker-node-hostname> and keepalived-<worker-node-hostname> in namespace openshift-kni-infra account for 200m of CPU Requests each, although they are scheduled on reserved cores. When Kubernetes' CPU Manager Policy is static (e.g. when a PerformanceProfile is defined), then one core is unavailable for Guaranteed QoS pods. When CPU Manager Policy Option full-pcpus-only is set (again, automatically configured when a PerformanceProfile is defined), then Guaranteed QoS pods can not be scheduled on the core's hyperthreading sibling, too. Effectively this renders 2 CPUs (1 full physical CPU) unusable for Guaranteed QoS pods.

      The relevant code sections and pods exist since at least OpenShift 4.12, but I only bothered to verify with OpenShift's latest development branch release-4.19.

      Single-node OpenShift (SNO) is not affected because both pods are not available on SNO.

      How to reproduce

      Take a OCP cluster on bare-metal with 3x control plane nodes (cp0, ..., cp2) and 2x worker nodes (w0, w1) that has been deployed from OpenShift's latest development branch release-4.19 and with Workload Partitioning. Apply a PerformanceProfile to the worker nodes. For example, the following snippet assumes that each worker node has 6 full cores (no HT):

      apiVersion: performance.openshift.io/v2
      kind: PerformanceProfile
      metadata:
        labels:
          machineconfiguration.openshift.io/role: worker
        name: ovs-dpdk-worker
      annotations:
        kubeletconfig.experimental: |
          cpuManagerPolicyOptions: {"full-pcpus-only": true, "strict-cpu-reservation": "true"}
      spec:
        ...
        cpu:
          isolated: "2-5"
          reserved: "0-1"
        nodeSelector:
          node-role.kubernetes.io/worker: ''
        numa:
          topologyPolicy: "restricted"
        realTimeKernel:
          enabled: true
      

      When it has been applied and all clusteroperators have stopped processing, analyse resource reports for worker nodes. How CPU Requests for worker node w0 are calculated in practice:

      $> oc describe node w0
      ...
        Namespace             Name           CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
        ---------             ----           ------------  ----------  ---------------  -------------  ---
      ...
        openshift-kni-infra   coredns-w0     200m (6%)     0 (0%)      400Mi (2%)       0 (0%)         19h
        openshift-kni-infra   keepalived-w0  200m (6%)     0 (0%)      400Mi (2%)       0 (0%)         3d19h
      ...
      

      However, both pods are scheduled on reserved cores. The following steps show how to find out that pod coredns-w0 and its processes are scheduled on cores 0-1, which are part of the reserved set of CPUs:

      [root@w0 core]# crictl ps | grep coredns
      08875670e3c73       261456df2f8b144878394e8bb4e460b8a034de77fd110d5bdf08b64675d1600b                                                                                         47 hours ago        Running             coredns-monitor                         4                   1ce095743e88d       coredns-w0                                               openshift-kni-infra
      144ee3febff08       055b0d93f9148beb67a65e87eb5f9ce1be830f92f7b9a39d4234a42231c18100                                                                                         47 hours ago        Running             coredns                                 4                   1ce095743e88d       coredns-w0                                               openshift-kni-infra
      
      [root@w0 core]# crictl inspect -o go-template --template '{{.info.runtimeSpec.linux.cgroupsPath}}' 144ee3febff08 
      kubepods-burstable-podaea8671fc62a7f24df6a3fc3f95ae686.slice:crio:144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39
      
      [root@w0 core]# systemctl list-units | grep 144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39
        crio-144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39.scope                                                                                                 loaded active running   libcrun container
        crio-conmon-144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39.scope                                                                                          loaded active running   crio-conmon-144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39.scope
      
      [root@w0 core]# systemctl show crio-144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39.scope --property EffectiveCPUs 
      EffectiveCPUs=0-1
      
      [root@w0 core]# crictl inspect -o go-template --template '{{.info.pid}}' 144ee3febff08 
      3212
      
      [root@w0 core]# taskset -a -c -p 3212
      pid 3212's current affinity list: 0,1
      pid 3240's current affinity list: 0,1
      pid 3242's current affinity list: 0,1
      pid 3243's current affinity list: 0,1
      pid 3244's current affinity list: 0,1
      pid 3266's current affinity list: 0,1
      pid 3267's current affinity list: 0,1
      pid 3280's current affinity list: 0,1
      pid 8873's current affinity list: 0,1
      pid 29450's current affinity list: 0,1
      pid 369624's current affinity list: 0,1
      

      This means, CPU Requests for worker node w0 should be calculated like this:

      $> oc describe node w0
      ...
        Namespace             Name           CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
        ---------             ----           ------------  ----------  ---------------  -------------  ---
      ...
        openshift-kni-infra   coredns-w0     0 (0%)        0 (0%)      400Mi (2%)       0 (0%)         19h
        openshift-kni-infra   keepalived-w0  0 (0%)        0 (0%)      400Mi (2%)       0 (0%)         3d19h
      ...
      

      Also, notice the following Kubelet error messages on worker node w0:

      [root@w0 core]# journalctl -u kubelet.service -xb
      ...
      Feb 27 11:42:07 w0 kubenswrapper[3124]: E0227 11:42:07.505135    3124 file.go:236] "Static Pod is managed but errored" err="managed container render-config-coredns does not have Resource.Requests" name="coredns-w0" namespace="openshift-kni-infra"
      Feb 27 11:42:07 w0 kubenswrapper[3124]: I0227 11:42:07.506414    3124 file.go:238] "Static Pod is managed. Using modified pod" name="kube-rbac-proxy-crio-w0" namespace="openshift-machine-config-operator" annotations={"kubernetes.io/config.hash":"bf9d35ec8358a627e1050a20dcafce29","openshift.io/required-scc":"privileged","resources.workload.openshift.io/kube-rbac-proxy-crio":"{\"cpushares\":20}","resources.workload.openshift.io/setup":"{\"cpushares\":5}","target.workload.openshift.io/management":"{\"effect\": \"PreferredDuringScheduling\"}"}
      Feb 27 11:42:07 w0 kubenswrapper[3124]: E0227 11:42:07.507791    3124 file.go:236] "Static Pod is managed but errored" err="managed container render-config-keepalived does not have Resource.Requests" name="keepalived-w0" namespace="openshift-kni-infra"
      ...
      

      This steers us to the root cause(s).

      Root cause(s)

      Both pods coredns-<worker-node-hostname> (coredns.yaml) and keepalived-<worker-node-hostname> (keepalived.yaml) define a initContainers field with containers render-config-coredns and render-config-keepalived. These two initContainers do not request resources:

      kind: Pod
      apiVersion: v1
      metadata:
        name: coredns
      ...
      spec:
      ...
        initContainers:
        - name: render-config-coredns
      ...
          resources: {}
      ...
        containers:
        - name: coredns
      ...
      

      Note, both pods are static pods which are installed by Machine Config Operator.

      When a PerformanceProfile with reserved cores has been defined, then Cluster Node Tuning Operator will write a file /etc/kubernetes/openshift-workload-pinning. Kubernetes will then modify static pods, in particular it will alter the resources field. For example, Kubernetes will change the resources field from

          resources:
            requests:
              cpu: 100m
              memory: 200Mi
      

      to

          resources:
            limits:
              management.workload.openshift.io/cores: "100"
            requests:
              management.workload.openshift.io/cores: "100"
              memory: 200Mi
      

      However, if resources.requests is not defined (correctly), then the code will bail out without applying any modifications, an error will be logged and resources will be accounted incorrectly.

      Note, non-static pods will be handled by a different code path in Kubernetes (explanation). For those non-static pods, the code will simply not try to modify resources.requests if it has not been specified.

      Potential solution(s)

      Populate the resources fields for both initContainers render-config-coredns and render-config-keepalived properly. For example, in coredns.yaml and keepalived.yaml change from

              resources: {}
      

      to

              resources:
                requests:
                  cpu: 100m
                  memory: 200Mi
      

      In addition, maybe synchronize the resource-modification-behavior for static pods and non-static pods. In particular, do not fail at lines 119 to 127 in kubelet/managed/managed.go. However, it should be clarified with original authors why Kubernetes behaves differently for both cases.

      Acknowledgements

      msivak@redhat.com and fromani@redhat.com for their guidance and for generously sharing their knowledge during the days spent debugging this.

              msivak@redhat.com Martin Sivak
              jmeng@redhat.com Jakob Meng
              None
              None
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              None
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: