Loading...

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: rhelai-1.5
Affects Version/s: rhelai-1.4, rhelai-1.5
Component/s: Development Platform
Labels:
- Gaudi
- Intel

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Sprint:
AIPCC Sprint 2, AIPCC Application Platform 3, AIPCC Application Platform 4

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

RHEL AI system with Intel Gaudi accelerators
Do [ilab config init]
Edit GPU count in serve section of config to use all the accelerators available in system
Run [ilab model serve]

Expected behavior

Successfully start serving the model

Actual behaviour

It fails with the exception [collective nonSFG is not supported during hpu graph capturing]

Device Info (please complete the following information):

Hardware Specs: 8x HL-325L accelerators
OS Version: registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.4 9.20250108.0
InstructLab Version: ilab, version 0.23.1

Provide the output of these two commands:
**

 ---------------------------: System Configuration :---------------------------
Num CPU Cores : 208
CPU RAM       : 2113255252 KB
------------------------------------------------------------------------------
Platform:
  sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.50.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: localhost.localdomain
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 2015.36 GB
  memory.available: 1988.79 GB
  memory.used: 17.80 GB

InstructLab:
  instructlab.version: 0.23.1
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.5.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.2
  instructlab-sdg.version: 0.7.0
  instructlab-training.version: 0.7.0

Torch:
  torch.version: 2.5.1a0+git6fc067b
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: None
  torch.version.hip: None
  torch.cuda.available: False
  torch.backends.cuda.is_built: False
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  habana_torch_plugin.version: 1.19.1.26
  torch.hpu.is_available: True
  torch.hpu.device_count: 8
  torch.hpu.0.name: GAUDI3
  torch.hpu.0.capability: 1.19.1.7608d85
  torch.hpu.0.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.1.name: GAUDI3
  torch.hpu.1.capability: 1.19.1.7608d85
  torch.hpu.1.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.2.name: GAUDI3
  torch.hpu.2.capability: 1.19.1.7608d85
  torch.hpu.2.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.3.name: GAUDI3
  torch.hpu.3.capability: 1.19.1.7608d85
  torch.hpu.3.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.4.name: GAUDI3
  torch.hpu.4.capability: 1.19.1.7608d85
  torch.hpu.4.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.5.name: GAUDI3
  torch.hpu.5.capability: 1.19.1.7608d85
  torch.hpu.5.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.6.name: GAUDI3
  torch.hpu.6.capability: 1.19.1.7608d85
  torch.hpu.6.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.7.name: GAUDI3
  torch.hpu.7.capability: 1.19.1.7608d85
  torch.hpu.7.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  env.HABANA_LOGS: /var/log/habana_logs/
  env.HABANA_PLUGINS_LIB_PATH: /opt/habanalabs/habana_plugins
  env.HABANA_PROFILE: profile_api_light
  env.HABANA_SCAL_BIN_PATH: /opt/habanalabs/engines_fw
  env.PT_HPU_ENABLE_LAZY_COLLECTIVES: true

llama_cpp_python:
  llama_cpp_python.version: 0.3.2
  llama_cpp_python.supports_gpu_offload: False

Bug impact