Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-31249

ucx library seg faults on Genoa and Sapphire rapids CPUs w/ mlx5 IB due to incompatible memmove calls - patches included

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • rhel-9.1.0, rhel-9.2.0, rhel-9.3.0, rhel-9.4, rhel-9.5
    • ucx
    • ucx-1.16.0-1.el9
    • None
    • Moderate
    • Patch, EasyFix
    • 2
    • rhel-sst-network-drivers
    • ssg_networking
    • 3
    • Dev ack
    • False
    • Hide

      None

      Show
      None
    • None
    • Red Hat Enterprise Linux
    • Network Drivers 5, Network Drivers 6
    • None
    • None
    • x86_64
    • None

      What were you trying to do that didn't work?

      Initial issue was OpenMPI 4.1.6 built w/ ucx failing running the sp.D test from the NAS Parallel Benchmarks.  Issue was reproduced with ucx_perftest to simplify debugging.

      root@rschhpc211:~# ucx_perftest -t ucp_am_lat -s `expr 1024 * 1024` rschhpc210

      root@rschhpc210:~# ucx_perftest
      [1698428074.879303] [rschhpc210:13557:0] perftest.c:899 UCX WARN CPU affinity is not set (bound to 384 cpus). Performance may be impacted.
      Waiting for connection...
      Accepted connection from 10.3.8.219:54350
      ----------------------------------------------------------------------------------------------------------

      API: protocol layer
      Test: am latency
      Data layout: (automatic)
      Send memory: host
      Recv memory: host
      Message size: 1048576
      AM header size: 0

      ----------------------------------------------------------------------------------------------------------
      [rschhpc210:13557:0:13557] ib_mlx5_log.c:162 Remote access on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
      [rschhpc210:13557:0:13557] ib_mlx5_log.c:162 RC QP 0x3177 wqe[60241]: RDMA_READ s-- [rva 0x7fc08799c000 rkey 0x2f1b1] [va 0x7fc4e3f63000 len 1048576 lkey 0x1bdd26] [rqpn 0x102 dlid=33 sl=0 port=1 src_path_bits=0]
      ==== backtrace (tid: 13557) ====
      0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x7fc4e5535fc4]
      1 /lib/x86_64-linux-gnu/libucs.so.0(ucs_fatal_error_message+0xb6) [0x7fc4e5536176]
      2 /lib/x86_64-linux-gnu/libucs.so.0(+0x25c9a) [0x7fc4e553ac9a]
      3 /lib/x86_64-linux-gnu/libucs.so.0(ucs_log_dispatch+0xe4) [0x7fc4e55344a4]
      4 /lib/x86_64-linux-gnu/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x5ed) [0x7fc4e509d6fd]
      5 /lib/x86_64-linux-gnu/ucx/libuct_ib.so.0(+0x3eb16) [0x7fc4e50b9b16]
      6 /lib/x86_64-linux-gnu/libucp.so.0(ucp_worker_progress+0x7a) [0x7fc4e55ed28a]
      7 ucx_perftest(+0x416de) [0x56329edf56de]
      8 ucx_perftest(+0x1ff92) [0x56329edd3f92]
      9 ucx_perftest(+0x82ea) [0x56329edbc2ea]
      10 ucx_perftest(+0x5a94) [0x56329edb9a94]
      11 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc4e5229d90]
      12 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fc4e5229e40]
      13 ucx_perftest(+0x6375) [0x56329edba375]
      =================================
      Aborted (core dumped)

      Please provide the package NVR for which bug is seen:

      How reproducible: reproduces with ucx perftest fairly reliably (~90% fail) on Genoa CPUs, only occasionally on Sapphire Rapids. OpenMPI sp.D test reproduces reliably close to 100% on both.

      The issue has been diagnosed and resolved upstream with the help of Nvidia. A portion of the ucx code gets "optimized" by the compiler into memmove calls that do not function properly with EPYC Genoa or Xeon Sapphire Rapids CPUs.  The change in the compiler regarding memmove occurred between gcc 10.3 and 10.4, so RHEL8 with older gcc works without the patches, but RHEL9 with newer gcc does not.

      After I identified that issue, Nvidia patched ucx to prevent that portion of code from being converted to memmove calls. I have successfully used these patches on several clusters already. Please add the patches from these 2 pull requests to fix the packages in RHEL permanently:

      https://github.com/openucx/ucx/pull/9692
      https://github.com/openucx/ucx/pull/9714

              kheib Kamal Heib
              quesar Rick Warner (Inactive)
              Kamal Heib Kamal Heib
              Afom Michael Afom Michael
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: