Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2750

A100 systems are not correctly identified by auto detection

XMLWordPrintable

    • Approved

      The H100 profiles do not match the expected code to file name translation logic since this card reports as:

      _CudaDeviceProperties(name=‘NVIDIA H100 80GB HBM3’, major=9, minor=0, total_memory=80994MB, multi_processor_count=132)

      There are several H100 variants:

      0x2321, 0x1839, 0x10de, "NVIDIA H100 NVL"
      0x2330, 0x16c0, 0x10de, "NVIDIA H100 80GB HBM3"
      0x2330, 0x16c1, 0x10de, "NVIDIA H100 80GB HBM3"
      0x2331, 0x1626, 0x10de, "NVIDIA H100 PCIe"
      0x2339, 0x17fc, 0x10de, "NVIDIA H100"

      A full list of all Nvidia card identifiers can be obtained in the open drivers (scroll past the nulls):
      https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/generated/g_nv_name_released.h

              cdoern@redhat.com Charles Doern
              cvultur@redhat.com Constantin Daniel Vultur
              Nathan Weinberg
              Constantin Daniel Vultur Constantin Daniel Vultur
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: