Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62727

node-exporter throws errors reading the InfiniBand class counter info

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Low
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      In the "node-exporters" pods are full of messages where the collector is not able to read the InfiniBand counters:

      2025-09-30T17:07:12.432674275+05:30 ts=2025-09-30T11:37:12.432Z caller=collector.go:169 level=error msg="collector failed" name=infiniband duration_seconds=0.006340497 err="error obtaining InfiniBand class info: failed to read file \"/host/sys/class/infiniband/qedr0/ports/1/counters/VL15_dropped\": invalid argument"
      

      This error is already fixed in upstream in https://github.com/prometheus/node_exporter/issues/3265.

      Version-Release number of selected component (if applicable):

      Openshift 4.18.22

      How reproducible:

      Always

      Steps to Reproduce:

      1. Deploy Openshift running on top of a hardware with Infiniband
      2. Review the "node-exporter" logs

      Actual results:

      1. Not counter metrics for Infiniband
      2. for each node-exporter a high number of logs produced with the error exist. In ~40 minutes around 358 per "node-exporter" pod
        $ oc logs node-exporter-c8jxr  -c node-exporter -n openshift-monitoring|head -1
        2025-09-30T16:22:57.923447493+05:30 ts=2025-09-30T10:52:57+00:00 num_cpus=96 gomaxprocs=4
        
        $ oc logs node-exporter-c8jxr  -c node-exporter -n openshift-monitoring |tail -1
        2025-09-30T17:07:42.429386262+05:30 ts=2025-09-30T11:37:42.429Z caller=collector.go:169 level=error msg="collector failed" name=infiniband duration_seconds=0.000256877 err="error obtaining InfiniBand class info: failed to read file \"/host/sys/class/infiniband/qedr0/ports/1/counters/VL15_dropped\": invalid argument"
        
        $ oc logs node-exporter-c8jxr  -c node-exporter -n openshift-monitoring|grep -c "error obtaining InfiniBand"
        358
        
        $ for pod in $(oc get pods -l app.kubernetes.io/name=node-exporter -o name); do oc logs $pod -c node-exporter  ; done|grep -c "error obtaining InfiniBand"
        2148
        

      Expected results:

      1. Infiniband counter are collected as metrics
      2. Not noisy errors in the "node-exporter" pods

      Additional info:

              prasriva@redhat.com Pranshu Srivastava
              rhn-support-ocasalsa Oscar Casal Sanchez
              None
              None
              Junqi Zhao Junqi Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: