-
Bug
-
Resolution: Done
-
Major
-
4.11
-
None
This is a backport from https://issues.redhat.com/browse/OCPBUGS-1044
Description of problem:
https://github.com/prometheus/node_exporter/issues/2299 The node exporter pod when ran on a bare metal worker using an AMD EPYC CPU crashes and fails to start up and crashes with the following error message. State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: 05.145Z caller=node_exporter.go:115 level=info collector=tapestats ts=2022-09-07T20:25:05.145Z caller=node_exporter.go:115 level=info collector=textfile ts=2022-09-07T20:25:05.145Z caller=node_exporter.go:115 level=info collector=thermal_zone ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=time ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=timex ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=udp_queues ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=uname ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=vmstat ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=xfs ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=zfs ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:199 level=info msg="Listening on" address=127.0.0.1:9100 ts=2022-09-07T20:25:05.146Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false panic: "node_rapl_package-0-die-0_joules_total" is not a valid metric name Apparently this is a known issue (See Github link) and was fixed in a later upstream.
Version-Release number of selected component (if applicable):
4.11.0
How reproducible:
Every-time
Steps to Reproduce:
1. Provision a bare metal node using an AMD EPYC CPU 2. Node-exporter pods that try to start on the nodes will crash with error message
Actual results:
Node-exporter pods cannot run on the new nodes
Expected results:
Node exporter pods should be able to start up and run like on any other node
Additional info:
As mentioned above this issue was tracked and fixed in a later upstream of node-exporter https://github.com/prometheus/node_exporter/issues/2299 Would we be able to get the fixed version pulled for 4.11?
- clones
-
OCPBUGS-1044 There's an issue with node-exporter pods running when using a bare metal AMD EPYC setup
- Closed
- depends on
-
OCPBUGS-1044 There's an issue with node-exporter pods running when using a bare metal AMD EPYC setup
- Closed
- links to