Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: 4.20.0
Affects Version/s: 4.19.z
Component/s: Node / Kubelet
Labels:
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.19.z
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
Done
Release Note Type:
Release Note Not Required
Release Note Text:
N/A

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of the problem

On a multi-node OpenShift cluster with workload partitioning, two pods coredns-<worker-node-hostname> and keepalived-<worker-node-hostname> in namespace openshift-kni-infra account for 200m of CPU Requests each, although they are scheduled on reserved cores. When Kubernetes' CPU Manager Policy is static (e.g. when a PerformanceProfile is defined), then one core is unavailable for Guaranteed QoS pods. When CPU Manager Policy Option full-pcpus-only is set (again, automatically configured when a PerformanceProfile is defined), then Guaranteed QoS pods can not be scheduled on the core's hyperthreading sibling, too. Effectively this renders 2 CPUs (1 full physical CPU) unusable for Guaranteed QoS pods.

The relevant code sections and pods exist since at least OpenShift 4.12, but I only bothered to verify with OpenShift's latest development branch release-4.19.

Single-node OpenShift (SNO) is not affected because both pods are not available on SNO.

How to reproduce

Take a OCP cluster on bare-metal with 3x control plane nodes (cp0, ..., cp2) and 2x worker nodes (w0, w1) that has been deployed from OpenShift's latest development branch release-4.19 and with Workload Partitioning. Apply a PerformanceProfile to the worker nodes. For example, the following snippet assumes that each worker node has 6 full cores (no HT):

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: ovs-dpdk-worker
annotations:
  kubeletconfig.experimental: |
    cpuManagerPolicyOptions: {"full-pcpus-only": true, "strict-cpu-reservation": "true"}
spec:
  ...
  cpu:
    isolated: "2-5"
    reserved: "0-1"
  nodeSelector:
    node-role.kubernetes.io/worker: ''
  numa:
    topologyPolicy: "restricted"
  realTimeKernel:
    enabled: true

When it has been applied and all clusteroperators have stopped processing, analyse resource reports for worker nodes. How CPU Requests for worker node w0 are calculated in practice:

$> oc describe node w0
...
  Namespace             Name           CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------             ----           ------------  ----------  ---------------  -------------  ---
...
  openshift-kni-infra   coredns-w0     200m (6%)     0 (0%)      400Mi (2%)       0 (0%)         19h
  openshift-kni-infra   keepalived-w0  200m (6%)     0 (0%)      400Mi (2%)       0 (0%)         3d19h
...

However, both pods are scheduled on reserved cores. The following steps show how to find out that pod coredns-w0 and its processes are scheduled on cores 0-1, which are part of the reserved set of CPUs:

[root@w0 core]# crictl ps | grep coredns
08875670e3c73       261456df2f8b144878394e8bb4e460b8a034de77fd110d5bdf08b64675d1600b                                                                                         47 hours ago        Running             coredns-monitor                         4                   1ce095743e88d       coredns-w0                                               openshift-kni-infra
144ee3febff08       055b0d93f9148beb67a65e87eb5f9ce1be830f92f7b9a39d4234a42231c18100                                                                                         47 hours ago        Running             coredns                                 4                   1ce095743e88d       coredns-w0                                               openshift-kni-infra

[root@w0 core]# crictl inspect -o go-template --template '{{.info.runtimeSpec.linux.cgroupsPath}}' 144ee3febff08 
kubepods-burstable-podaea8671fc62a7f24df6a3fc3f95ae686.slice:crio:144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39

[root@w0 core]# systemctl list-units | grep 144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39
  crio-144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39.scope                                                                                                 loaded active running   libcrun container
  crio-conmon-144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39.scope                                                                                          loaded active running   crio-conmon-144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39.scope

[root@w0 core]# systemctl show crio-144ee3febff08720f7d2d23de947ecc9cb5d6e076019c3c489af22da302cfd39.scope --property EffectiveCPUs 
EffectiveCPUs=0-1

[root@w0 core]# crictl inspect -o go-template --template '{{.info.pid}}' 144ee3febff08 
3212

[root@w0 core]# taskset -a -c -p 3212
pid 3212's current affinity list: 0,1
pid 3240's current affinity list: 0,1
pid 3242's current affinity list: 0,1
pid 3243's current affinity list: 0,1
pid 3244's current affinity list: 0,1
pid 3266's current affinity list: 0,1
pid 3267's current affinity list: 0,1
pid 3280's current affinity list: 0,1
pid 8873's current affinity list: 0,1
pid 29450's current affinity list: 0,1
pid 369624's current affinity list: 0,1

This means, CPU Requests for worker node w0 should be calculated like this:

$> oc describe node w0
...
  Namespace             Name           CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------             ----           ------------  ----------  ---------------  -------------  ---
...
  openshift-kni-infra   coredns-w0     0 (0%)        0 (0%)      400Mi (2%)       0 (0%)         19h
  openshift-kni-infra   keepalived-w0  0 (0%)        0 (0%)      400Mi (2%)       0 (0%)         3d19h
...

Also, notice the following Kubelet error messages on worker node w0:

[root@w0 core]# journalctl -u kubelet.service -xb
...
Feb 27 11:42:07 w0 kubenswrapper[3124]: E0227 11:42:07.505135    3124 file.go:236] "Static Pod is managed but errored" err="managed container render-config-coredns does not have Resource.Requests" name="coredns-w0" namespace="openshift-kni-infra"
Feb 27 11:42:07 w0 kubenswrapper[3124]: I0227 11:42:07.506414    3124 file.go:238] "Static Pod is managed. Using modified pod" name="kube-rbac-proxy-crio-w0" namespace="openshift-machine-config-operator" annotations={"kubernetes.io/config.hash":"bf9d35ec8358a627e1050a20dcafce29","openshift.io/required-scc":"privileged","resources.workload.openshift.io/kube-rbac-proxy-crio":"{\"cpushares\":20}","resources.workload.openshift.io/setup":"{\"cpushares\":5}","target.workload.openshift.io/management":"{\"effect\": \"PreferredDuringScheduling\"}"}
Feb 27 11:42:07 w0 kubenswrapper[3124]: E0227 11:42:07.507791    3124 file.go:236] "Static Pod is managed but errored" err="managed container render-config-keepalived does not have Resource.Requests" name="keepalived-w0" namespace="openshift-kni-infra"
...

This steers us to the root cause(s).

Root cause(s)

Both pods coredns-<worker-node-hostname> (coredns.yaml) and keepalived-<worker-node-hostname> (keepalived.yaml) define a initContainers field with containers render-config-coredns and render-config-keepalived. These two initContainers do not request resources:

kind: Pod
apiVersion: v1
metadata:
  name: coredns
...
spec:
...
  initContainers:
  - name: render-config-coredns
...
    resources: {}
...
  containers:
  - name: coredns
...

Note, both pods are static pods which are installed by Machine Config Operator.

When a PerformanceProfile with reserved cores has been defined, then Cluster Node Tuning Operator will write a file /etc/kubernetes/openshift-workload-pinning. Kubernetes will then modify static pods, in particular it will alter the resources field. For example, Kubernetes will change the resources field from

    resources:
      requests:
        cpu: 100m
        memory: 200Mi

    resources:
      limits:
        management.workload.openshift.io/cores: "100"
      requests:
        management.workload.openshift.io/cores: "100"
        memory: 200Mi

However, if resources.requests is not defined (correctly), then the code will bail out without applying any modifications, an error will be logged and resources will be accounted incorrectly.

Note, non-static pods will be handled by a different code path in Kubernetes (explanation). For those non-static pods, the code will simply not try to modify resources.requests if it has not been specified.

Potential solution(s)

Populate the resources fields for both initContainers render-config-coredns and render-config-keepalived properly. For example, in coredns.yaml and keepalived.yaml change from

        resources: {}

        resources:
          requests:
            cpu: 100m
            memory: 200Mi

In addition, maybe synchronize the resource-modification-behavior for static pods and non-static pods. In particular, do not fail at lines 119 to 127 in kubelet/managed/managed.go. However, it should be clarified with original authors why Kubernetes behaves differently for both cases.

Acknowledgements

msivak@redhat.com and fromani@redhat.com for their guidance and for generously sharing their knowledge during the days spent debugging this.

links to

openshift/kubernetes#2224: OCPBUGS-52169: <carry>: Workload partitioning of static init containers

mentioned on

Merge request - Shipment for 4.19.9

Assignee:: Martin Sivak

Reporter:: Jakob Meng

Need Info From:: None

Contributors:: None

QA Contact:: Sergio Regidor de la Rosa

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2025/02/28 9:56 AM

Updated:: 2025/09/19 5:22 PM

Resolved:: 2025/09/09 1:34 PM

Details

Description

Description of the problem

How to reproduce

Root cause(s)

Potential solution(s)

Acknowledgements

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates