Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1723

Node-exporter on a bare metal AMD EPYC setup (backport to 4.11)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 4.11
    • 4.11
    • Monitoring
    • None
    • ?
    • Important
    • None
    • Proposed
    • False
    • Hide

      None

      Show
      None

      This is a backport from https://issues.redhat.com/browse/OCPBUGS-1044

       

      Description of problem:

      https://github.com/prometheus/node_exporter/issues/2299
      
      The node exporter pod when ran on a bare metal worker using an AMD EPYC CPU crashes and fails to start up and crashes with the following error message.    State:       Waiting
            Reason:    CrashLoopBackOff
          Last State:  Terminated
            Reason:    Error
            Message:   05.145Z caller=node_exporter.go:115 level=info collector=tapestats
      ts=2022-09-07T20:25:05.145Z caller=node_exporter.go:115 level=info collector=textfile
      ts=2022-09-07T20:25:05.145Z caller=node_exporter.go:115 level=info collector=thermal_zone
      ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=time
      ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=timex
      ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=udp_queues
      ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=uname
      ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=vmstat
      ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=xfs
      ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:115 level=info collector=zfs
      ts=2022-09-07T20:25:05.146Z caller=node_exporter.go:199 level=info msg="Listening on" address=127.0.0.1:9100
      ts=2022-09-07T20:25:05.146Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
      panic: "node_rapl_package-0-die-0_joules_total" is not a valid metric name
      
       Apparently this is a known issue (See Github link) and was fixed in a later upstream.
      
      

      Version-Release number of selected component (if applicable):

      4.11.0
      
      

      How reproducible:

      Every-time 
      
      

      Steps to Reproduce:

      1. Provision a bare metal node using an AMD EPYC CPU
      2. Node-exporter pods that try to start on the nodes will crash with error message
      
      

      Actual results:

      Node-exporter pods cannot run on the new nodes 
      
      

      Expected results:

      Node exporter pods should be able to start up and run like on any other node
      
      

      Additional info:

      As mentioned above this issue was tracked and fixed in a later upstream of node-exporter
      
      https://github.com/prometheus/node_exporter/issues/2299
      
      Would we be able to get the fixed version pulled for 4.11?
      
      

              hasun@redhat.com Haoyu Sun
              rhn-support-cruhm Courtney Ruhm
              Junqi Zhao Junqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: