Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63188

PAI PMU PAI_EXT missing on RHCOS 9 workers perf has no NNPA events; blocks Telum/zDNN validation

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • s390x
    • Dev
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      hardware advertises NNPA and Triton uses Telum zDNN but the PAI PMU is not present on the host is missing /sys/bus/event_source/devices/pai_ext so perf does not register nnpa events and fails with bad event name 

      Version-Release number of selected component (if applicable):

      ocp    4.19.4 Worker kernel: 5.14.0-570.26.1.el9_6.s390x  OS image: RHCOS 9 (node shows 9.6; debug pod base is RHEL 9.4) CRI: cri-o 1.32.6 z/OS zCX  perf: 5.14.0-427.76.1.el9_4.s390x cpu flags include nnpa 

      How reproducible:

          always

      Steps to Reproduce:

          1. Deploy Triton Inference Server using an ONNX-MLIR/zDNN model on s390x workers and send inference     2. oc debug node/<worker>; nsenter -a -t 1     3. run ls -ld /sys/bus/event_source/devices/pai_ext
         perf list | grep -E '(^| )pai_ext/|^ *NNPA_'
         perf stat -a -e NNPA_ALL sleep 10     

      Actual results:

      - /sys/bus/event_source/devices/pai_ext is missing
      - perf list shows no NNPA events
      - perf ... -e NNPA_* → Bad event name

      Expected results:

      - pai_ext device exists on host - perf list shows NNPA_* events (e.g., NNPA_ALL, NNPA_MATMUL_OP, NNPA_ADD) - perf -e NNPA_* returns nonzero counts under NNPA load

      Additional info:

          

              Unassigned Unassigned
              brian.rubenstein@ibm.com Brian Rubenstein
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: