Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.14.0
Affects Version/s: 4.13.0
Component/s: Monitoring
Labels:
None

Test Coverage:

+
Severity:
Important
Regression:
No
Sprint:
MON Sprint 238
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Before this update, large amounts of CPU resources might be consumed during metrics scraping as a result of the way node-exporter collected network interface information. This release fixes this issue by improving the performance of node-exporter when collecting network interface information, thereby resolving the issue with excessive CPU usage during metrics scraping. link:https://issues.redhat.com/browse/OCPBUGS-12714[~~OCPBUGS-12714~~]

Show
* Before this update, large amounts of CPU resources might be consumed during metrics scraping as a result of the way node-exporter collected network interface information. This release fixes this issue by improving the performance of node-exporter when collecting network interface information, thereby resolving the issue with excessive CPU usage during metrics scraping. link: https://issues.redhat.com/browse/OCPBUGS-12714 [ OCPBUGS-12714 ]
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.14.0
Escape Reason:
Escape Impact:
Corrective Measures:
SDLC stage when should've been found:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Under heavy control plane load (bringing up ~200 pods), prometheus/promtail spikes to over 100% CPU, node_exporter goes to ~200% cpu and stays there for 5-10 minutes. Tested on a GCP cluster bot using 2 physical core (4 vcpu) workers. This starves out essential platform functions like OVS from getting any CPU and causes the data plane to go down.

Running perf against node_exporter reveals the application is consuming the majority of its CPU trying to list new interfaces being added in sysfs. This looks like it is due to disbling netlink via:

https://issues.redhat.com/browse/OCPBUGS-8282

This operation grabs the rtnl lock which can compete with other components on the host that are trying to configure networking.

Version-Release number of selected component (if applicable):

Tested on 4.13 and 4.14 with GCP.

How reproducible:

3/4 times

Steps to Reproduce:

1. Launch gcp with cluster bot
2. Create a deployment with pause containers which will max out pods on the nodes:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webserver-deployment
  namespace: openshift-ovn-kubernetes
  labels:
    pod-name: server
    app: nginx
    role: webserver
spec:
  replicas: 700
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
        role: webserver
    spec:
      containers:
        - name: webserver1
          image: k8s.gcr.io/pause:3.1
          ports:
            - containerPort: 80
              name: serve-80
              protocol: TCP 
3. Watch top cpu output. Wait for node_exporter and prometheus to show very high CPU. If this does not happen, proceed to step 4.
4. Delete the deployment and then recreate it.
5. High and persistent CPU usage should now be observed.

Actual results:

CPU is pegged on the host for several minutes. Terminal is almost unresponsive. Only way to fix it was to delete node_exporter and prometheus DS.

Expected results:

Prometheus and other metrics related applications should:
1. use netlink to avoid grabbing rtnl lock
2. should be cpu limited. Certain required applications in OCP are resource unbounded (like networking data plane) to ensure the node's core functions continue to work. Metrics however should be CPU limited to avoid tooling from locking up a node.

Additional info:

Perf summary (will attach full perf output)
    99.94%     0.00%  node_exporter  node_exporter      [.] runtime.goexit.abi0
            |
            ---runtime.goexit.abi0
               |
                --99.33%--github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func2
                          |
                           --99.33%--github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func1
                                     |
                                      --99.33%--github.com/prometheus/node_exporter/collector.execute
                                                |
                                                |--97.67%--github.com/prometheus/node_exporter/collector.(*netClassCollector).Update
                                                |          |
                                                |           --97.67%--github.com/prometheus/node_exporter/collector.(*netClassCollector).netClassSysfsUpdate
                                                |                     |
                                                |                      --97.67%--github.com/prometheus/node_exporter/collector.(*netClassCollector).getNetClassInfo
                                                |                                |
                                                |                                 --97.64%--github.com/prometheus/procfs/sysfs.FS.NetClassByIface
                                                |                                           |
                                                |                                            --97.64%--github.com/prometheus/procfs/sysfs.parseNetClassIface
                                                |                                                      |
                                                |                                                       --97.61%--github.com/prometheus/procfs/internal/util.SysReadFile
                                                |                                                                 |
                                                |                                                                  --97.45%--syscall.read
                                                |                                                                            |
                                                |                                                                             --97.45%--syscall.Syscall
                                                |                                                                                       |
                                                |                                                                                        --97.45%--runtime/internal/syscall.Syscall6
                                                |                                                                                                  |
                                                |                                                                                                   --70.34%--entry_SYSCALL_64_after_hwframe
                                                |                                                                                                             do_syscall_64
                                                |                                                                                                             |
                                                |                                                                                                             |--39.13%--ksys_read
                                                |                                                                                                             |          |
                                                |                                                                                                             |          |--31.97%--vfs_read

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

node_exporter_perf.out
513 kB
2023/04/25 1:57 PM

is related to

OCPBUGS-11591 Mass sig-network test failures on GCP OVN

Closed

links to

openshift/cluster-monitoring-operator#2015: OCPBUGS-12714: turn on netlink mode of netclass collector for node exporter

openshift/node_exporter#129: OCPBUGS-12714: Bump openshift/node_exporter to v1.6.0

RHSA-2023:5006 OpenShift Container Platform 4.14.z security update

Assignee:: Brian Burt

Reporter:: Tim Rozet

QA Contact:: Tai Gao

Doc Contact:: Brian Burt

Contributors:: Brian Burt

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2023/04/25 1:57 PM

Updated:: 2024/11/25 6:35 AM

Resolved:: 2023/10/31 1:40 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates