-
Bug
-
Resolution: Done
-
Normal
-
rhelai-1.4, rhelai-1.5
-
False
-
-
False
-
-
-
AIPCC Sprint 2, AIPCC Application Platform 3, AIPCC Application Platform 4
To Reproduce Steps to reproduce the behavior:
- RHEL AI system with Intel Gaudi accelerators
- Do [ilab config init]
- Edit GPU count in serve section of config to use all the accelerators available in system
- Run [ilab model serve]
Expected behavior
- Successfully start serving the model
Actual behaviour
- It fails with the exception [collective nonSFG is not supported during hpu graph capturing]
Device Info (please complete the following information):
- Hardware Specs: 8x HL-325L accelerators
- OS Version: registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.4 9.20250108.0
- InstructLab Version: ilab, version 0.23.1
- Provide the output of these two commands:
**---------------------------: System Configuration :--------------------------- Num CPU Cores : 208 CPU RAM : 2113255252 KB ------------------------------------------------------------------------------ Platform: sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.50.1.el9_4.x86_64 platform.machine: x86_64 platform.node: localhost.localdomain platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 2015.36 GB memory.available: 1988.79 GB memory.used: 17.80 GB InstructLab: instructlab.version: 0.23.1 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.5.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.2 instructlab-sdg.version: 0.7.0 instructlab-training.version: 0.7.0 Torch: torch.version: 2.5.1a0+git6fc067b torch.backends.cpu.capability: AVX512 torch.version.cuda: None torch.version.hip: None torch.cuda.available: False torch.backends.cuda.is_built: False torch.backends.mps.is_built: False torch.backends.mps.is_available: False habana_torch_plugin.version: 1.19.1.26 torch.hpu.is_available: True torch.hpu.device_count: 8 torch.hpu.0.name: GAUDI3 torch.hpu.0.capability: 1.19.1.7608d85 torch.hpu.0.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5 torch.hpu.1.name: GAUDI3 torch.hpu.1.capability: 1.19.1.7608d85 torch.hpu.1.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5 torch.hpu.2.name: GAUDI3 torch.hpu.2.capability: 1.19.1.7608d85 torch.hpu.2.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5 torch.hpu.3.name: GAUDI3 torch.hpu.3.capability: 1.19.1.7608d85 torch.hpu.3.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5 torch.hpu.4.name: GAUDI3 torch.hpu.4.capability: 1.19.1.7608d85 torch.hpu.4.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5 torch.hpu.5.name: GAUDI3 torch.hpu.5.capability: 1.19.1.7608d85 torch.hpu.5.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5 torch.hpu.6.name: GAUDI3 torch.hpu.6.capability: 1.19.1.7608d85 torch.hpu.6.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5 torch.hpu.7.name: GAUDI3 torch.hpu.7.capability: 1.19.1.7608d85 torch.hpu.7.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5 env.HABANA_LOGS: /var/log/habana_logs/ env.HABANA_PLUGINS_LIB_PATH: /opt/habanalabs/habana_plugins env.HABANA_PROFILE: profile_api_light env.HABANA_SCAL_BIN_PATH: /opt/habanalabs/engines_fw env.PT_HPU_ENABLE_LAZY_COLLECTIVES: true llama_cpp_python: llama_cpp_python.version: 0.3.2 llama_cpp_python.supports_gpu_offload: False
Bug impact
- ilab model serve fails
Known workaround
*Add "--env" "PT_HPU_ENABLE_LAZY_COLLECTIVES=true" to ilab wrapper
Additional context