-
Bug
-
Resolution: Done
-
Normal
-
4.19.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
Done
-
Release Note Not Required
-
N/A
-
None
-
None
-
None
-
None
Description of the problem
On a multi-node OpenShift cluster with workload partitioning, two pods coredns-<worker-node-hostname> and keepalived-<worker-node-hostname> in namespace openshift-kni-infra account for 200m of CPU Requests each, although they are scheduled on reserved cores. When Kubernetes' CPU Manager Policy is static (e.g. when a PerformanceProfile is defined), then one core is unavailable for Guaranteed QoS pods. When CPU Manager Policy Option full-pcpus-only is set (again, automatically configured when a PerformanceProfile is defined), then Guaranteed QoS pods can not be scheduled on the core's hyperthreading sibling, too. Effectively this renders 2 CPUs (1 full physical CPU) unusable for Guaranteed QoS pods.
The relevant code sections and pods exist since at least OpenShift 4.12, but I only bothered to verify with OpenShift's latest development branch release-4.19.
Single-node OpenShift (SNO) is not affected because both pods are not available on SNO.
How to reproduce
Take a OCP cluster on bare-metal with 3x control plane nodes (cp0, ..., cp2) and 2x worker nodes (w0, w1) that has been deployed from OpenShift's latest development branch release-4.19 and with Workload Partitioning. Apply a PerformanceProfile to the worker nodes. For example, the following snippet assumes that each worker node has 6 full cores (no HT):
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: labels: machineconfiguration.openshift.io/role: worker name: ovs-dpdk-worker annotations: kubeletconfig.experimental: | cpuManagerPolicyOptions: {"full-pcpus-only": true, "strict-cpu-reservation": "true"} spec: ... cpu: isolated: "2-5" reserved: "0-1" nodeSelector: node-role.kubernetes.io/worker: '' numa: topologyPolicy: "restricted" realTimeKernel: enabled: true
When it has been applied and all clusteroperators have stopped processing, analyse resource reports for worker nodes. How CPU Requests for worker node w0 are calculated in practice:
$> oc describe node w0 ... Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- ... openshift-kni-infra coredns-w0 200m (6%) 0 (0%) 400Mi (2%) 0 (0%) 19h openshift-kni-infra keepalived-w0 200m (6%) 0 (0%) 400Mi (2%) 0 (0%) 3d19h ...
However, both pods are scheduled on reserved cores. The following steps show how to find out that pod coredns-w0 and its processes are scheduled on cores 0-1, which are part of the reserved set of CPUs:
[root@w0 core]# crictl ps | grep coredns 08875670e3c73 261456df2f8b144878394e8bb4e460b8a034de77fd110d5bdf08b64675d1600b 47 hours ago Running coredns-monitor 4 1ce095743e88d coredns-w0 openshift-kni-infra 144ee3febff08 055b0d93f9148beb67a65e87eb5f9ce1be830f92f7b9a39d4234a42231c18100 47 hours ago Running coredns 4 1ce095743e88d coredns-w0 openshift-kni-infra [root@w0 core]# crictl inspect -o go-template --template '{{.info.runtimeSpec.linux.cgroupsPath}}' 144ee3febff08 kubepods-burstable-podaea8671fc62a7f24df6a3fc3f95ae686.slice:crio:144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39 [root@w0 core]# systemctl list-units | grep 144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39 crio-144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39.scope loaded active running libcrun container crio-conmon-144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39.scope loaded active running crio-conmon-144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39.scope [root@w0 core]# systemctl show crio-144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39.scope --property EffectiveCPUs EffectiveCPUs=0-1 [root@w0 core]# crictl inspect -o go-template --template '{{.info.pid}}' 144ee3febff08 3212 [root@w0 core]# taskset -a -c -p 3212 pid 3212's current affinity list: 0,1 pid 3240's current affinity list: 0,1 pid 3242's current affinity list: 0,1 pid 3243's current affinity list: 0,1 pid 3244's current affinity list: 0,1 pid 3266's current affinity list: 0,1 pid 3267's current affinity list: 0,1 pid 3280's current affinity list: 0,1 pid 8873's current affinity list: 0,1 pid 29450's current affinity list: 0,1 pid 369624's current affinity list: 0,1
This means, CPU Requests for worker node w0 should be calculated like this:
$> oc describe node w0 ... Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- ... openshift-kni-infra coredns-w0 0 (0%) 0 (0%) 400Mi (2%) 0 (0%) 19h openshift-kni-infra keepalived-w0 0 (0%) 0 (0%) 400Mi (2%) 0 (0%) 3d19h ...
Also, notice the following Kubelet error messages on worker node w0:
[root@w0 core]# journalctl -u kubelet.service -xb ... Feb 27 11:42:07 w0 kubenswrapper[3124]: E0227 11:42:07.505135 3124 file.go:236] "Static Pod is managed but errored" err="managed container render-config-coredns does not have Resource.Requests" name="coredns-w0" namespace="openshift-kni-infra" Feb 27 11:42:07 w0 kubenswrapper[3124]: I0227 11:42:07.506414 3124 file.go:238] "Static Pod is managed. Using modified pod" name="kube-rbac-proxy-crio-w0" namespace="openshift-machine-config-operator" annotations={"kubernetes.io/config.hash":"bf9d35ec8358a627e1050a20dcafce29","openshift.io/required-scc":"privileged","resources.workload.openshift.io/kube-rbac-proxy-crio":"{\"cpushares\":20}","resources.workload.openshift.io/setup":"{\"cpushares\":5}","target.workload.openshift.io/management":"{\"effect\": \"PreferredDuringScheduling\"}"} Feb 27 11:42:07 w0 kubenswrapper[3124]: E0227 11:42:07.507791 3124 file.go:236] "Static Pod is managed but errored" err="managed container render-config-keepalived does not have Resource.Requests" name="keepalived-w0" namespace="openshift-kni-infra" ...
This steers us to the root cause(s).
Root cause(s)
Both pods coredns-<worker-node-hostname> (coredns.yaml) and keepalived-<worker-node-hostname> (keepalived.yaml) define a initContainers field with containers render-config-coredns and render-config-keepalived. These two initContainers do not request resources:
kind: Pod apiVersion: v1 metadata: name: coredns ... spec: ... initContainers: - name: render-config-coredns ... resources: {} ... containers: - name: coredns ...
Note, both pods are static pods which are installed by Machine Config Operator.
When a PerformanceProfile with reserved cores has been defined, then Cluster Node Tuning Operator will write a file /etc/kubernetes/openshift-workload-pinning. Kubernetes will then modify static pods, in particular it will alter the resources field. For example, Kubernetes will change the resources field from
resources: requests: cpu: 100m memory: 200Mi
to
resources: limits: management.workload.openshift.io/cores: "100" requests: management.workload.openshift.io/cores: "100" memory: 200Mi
However, if resources.requests is not defined (correctly), then the code will bail out without applying any modifications, an error will be logged and resources will be accounted incorrectly.
Note, non-static pods will be handled by a different code path in Kubernetes (explanation). For those non-static pods, the code will simply not try to modify resources.requests if it has not been specified.
Potential solution(s)
Populate the resources fields for both initContainers render-config-coredns and render-config-keepalived properly. For example, in coredns.yaml and keepalived.yaml change from
resources: {}
to
resources: requests: cpu: 100m memory: 200Mi
In addition, maybe synchronize the resource-modification-behavior for static pods and non-static pods. In particular, do not fail at lines 119 to 127 in kubelet/managed/managed.go. However, it should be clarified with original authors why Kubernetes behaves differently for both cases.
Acknowledgements
msivak@redhat.com and fromani@redhat.com for their guidance and for generously sharing their knowledge during the days spent debugging this.