-
Bug
-
Resolution: Done-Errata
-
Major
-
4.13.0
-
None
-
+
-
Important
-
No
-
MON Sprint 238
-
1
-
Rejected
-
False
-
-
-
Bug Fix
-
Done
Description of problem:
Under heavy control plane load (bringing up ~200 pods), prometheus/promtail spikes to over 100% CPU, node_exporter goes to ~200% cpu and stays there for 5-10 minutes. Tested on a GCP cluster bot using 2 physical core (4 vcpu) workers. This starves out essential platform functions like OVS from getting any CPU and causes the data plane to go down. Running perf against node_exporter reveals the application is consuming the majority of its CPU trying to list new interfaces being added in sysfs. This looks like it is due to disbling netlink via: https://issues.redhat.com/browse/OCPBUGS-8282 This operation grabs the rtnl lock which can compete with other components on the host that are trying to configure networking.
Version-Release number of selected component (if applicable):
Tested on 4.13 and 4.14 with GCP.
How reproducible:
3/4 times
Steps to Reproduce:
1. Launch gcp with cluster bot 2. Create a deployment with pause containers which will max out pods on the nodes: --- apiVersion: apps/v1 kind: Deployment metadata: name: webserver-deployment namespace: openshift-ovn-kubernetes labels: pod-name: server app: nginx role: webserver spec: replicas: 700 selector: matchLabels: app: nginx template: metadata: labels: app: nginx role: webserver spec: containers: - name: webserver1 image: k8s.gcr.io/pause:3.1 ports: - containerPort: 80 name: serve-80 protocol: TCP 3. Watch top cpu output. Wait for node_exporter and prometheus to show very high CPU. If this does not happen, proceed to step 4. 4. Delete the deployment and then recreate it. 5. High and persistent CPU usage should now be observed.
Actual results:
CPU is pegged on the host for several minutes. Terminal is almost unresponsive. Only way to fix it was to delete node_exporter and prometheus DS.
Expected results:
Prometheus and other metrics related applications should: 1. use netlink to avoid grabbing rtnl lock 2. should be cpu limited. Certain required applications in OCP are resource unbounded (like networking data plane) to ensure the node's core functions continue to work. Metrics however should be CPU limited to avoid tooling from locking up a node.
Additional info:
Perf summary (will attach full perf output) 99.94% 0.00% node_exporter node_exporter [.] runtime.goexit.abi0 | ---runtime.goexit.abi0 | --99.33%--github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func2 | --99.33%--github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func1 | --99.33%--github.com/prometheus/node_exporter/collector.execute | |--97.67%--github.com/prometheus/node_exporter/collector.(*netClassCollector).Update | | | --97.67%--github.com/prometheus/node_exporter/collector.(*netClassCollector).netClassSysfsUpdate | | | --97.67%--github.com/prometheus/node_exporter/collector.(*netClassCollector).getNetClassInfo | | | --97.64%--github.com/prometheus/procfs/sysfs.FS.NetClassByIface | | | --97.64%--github.com/prometheus/procfs/sysfs.parseNetClassIface | | | --97.61%--github.com/prometheus/procfs/internal/util.SysReadFile | | | --97.45%--syscall.read | | | --97.45%--syscall.Syscall | | | --97.45%--runtime/internal/syscall.Syscall6 | | | --70.34%--entry_SYSCALL_64_after_hwframe | do_syscall_64 | | | |--39.13%--ksys_read | | | | | |--31.97%--vfs_read
- is related to
-
OCPBUGS-11591 Mass sig-network test failures on GCP OVN
- Closed
- links to
-
RHSA-2023:5006 OpenShift Container Platform 4.14.z security update