Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-740

Inferencing fails on Intel Gaudi for multi-accelerator config

    • False
    • Hide

      None

      Show
      None
    • False
    • AIPCC Sprint 2, AIPCC Application Platform 3, AIPCC Application Platform 4

      To Reproduce Steps to reproduce the behavior:

      1. RHEL AI system with Intel Gaudi accelerators
      2. Do [ilab config init]
      3. Edit GPU count in serve section of config to use all the accelerators available in system
      4. Run [ilab model serve]

      Expected behavior

      • Successfully start serving the model

      Actual behaviour

      • It fails with the exception [collective nonSFG is not supported during hpu graph capturing]

      Device Info (please complete the following information):

      • Hardware Specs: 8x HL-325L accelerators
      • OS Version: registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.4 9.20250108.0
      • InstructLab Version: ilab, version 0.23.1
      • Provide the output of these two commands:
        **
         ---------------------------: System Configuration :---------------------------
        Num CPU Cores : 208
        CPU RAM       : 2113255252 KB
        ------------------------------------------------------------------------------
        Platform:
          sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
          sys.platform: linux
          os.name: posix
          platform.release: 5.14.0-427.50.1.el9_4.x86_64
          platform.machine: x86_64
          platform.node: localhost.localdomain
          platform.python_version: 3.11.7
          os-release.ID: rhel
          os-release.VERSION_ID: 9.4
          os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
          memory.total: 2015.36 GB
          memory.available: 1988.79 GB
          memory.used: 17.80 GB
        
        InstructLab:
          instructlab.version: 0.23.1
          instructlab-dolomite.version: 0.2.0
          instructlab-eval.version: 0.5.1
          instructlab-quantize.version: 0.1.0
          instructlab-schema.version: 0.4.2
          instructlab-sdg.version: 0.7.0
          instructlab-training.version: 0.7.0
        
        Torch:
          torch.version: 2.5.1a0+git6fc067b
          torch.backends.cpu.capability: AVX512
          torch.version.cuda: None
          torch.version.hip: None
          torch.cuda.available: False
          torch.backends.cuda.is_built: False
          torch.backends.mps.is_built: False
          torch.backends.mps.is_available: False
          habana_torch_plugin.version: 1.19.1.26
          torch.hpu.is_available: True
          torch.hpu.device_count: 8
          torch.hpu.0.name: GAUDI3
          torch.hpu.0.capability: 1.19.1.7608d85
          torch.hpu.0.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
          torch.hpu.1.name: GAUDI3
          torch.hpu.1.capability: 1.19.1.7608d85
          torch.hpu.1.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
          torch.hpu.2.name: GAUDI3
          torch.hpu.2.capability: 1.19.1.7608d85
          torch.hpu.2.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
          torch.hpu.3.name: GAUDI3
          torch.hpu.3.capability: 1.19.1.7608d85
          torch.hpu.3.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
          torch.hpu.4.name: GAUDI3
          torch.hpu.4.capability: 1.19.1.7608d85
          torch.hpu.4.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
          torch.hpu.5.name: GAUDI3
          torch.hpu.5.capability: 1.19.1.7608d85
          torch.hpu.5.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
          torch.hpu.6.name: GAUDI3
          torch.hpu.6.capability: 1.19.1.7608d85
          torch.hpu.6.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
          torch.hpu.7.name: GAUDI3
          torch.hpu.7.capability: 1.19.1.7608d85
          torch.hpu.7.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
          env.HABANA_LOGS: /var/log/habana_logs/
          env.HABANA_PLUGINS_LIB_PATH: /opt/habanalabs/habana_plugins
          env.HABANA_PROFILE: profile_api_light
          env.HABANA_SCAL_BIN_PATH: /opt/habanalabs/engines_fw
          env.PT_HPU_ENABLE_LAZY_COLLECTIVES: true
        
        llama_cpp_python:
          llama_cpp_python.version: 0.3.2
          llama_cpp_python.supports_gpu_offload: False
        
        

      Bug impact

      • ilab model serve fails

      Known workaround
      *Add "--env" "PT_HPU_ENABLE_LAZY_COLLECTIVES=true" to ilab wrapper

      Additional context

              spryor@redhat.com Sean Pryor
              rh-ee-aturate Aman Turate
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: