Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: rhelai-1.3.1
Affects Version/s: rhelai-1.3
Component/s: Accelerators - Intel Gaudi
Labels:
- Gaudi
- Intel

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Severity:
Critical

Release Blocker:
Proposed

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

Boot from ISO a BM or VM
Configure root user, user, network, gw, dns and hostname
Install system
After the installation the network is down and not configured the hostname
while investigating the NetworkManager service there were error messages related to dbus:

Dec 05 11:26:39 localhost NetworkManager[5287]: <info>  [1733397999.2333] NetworkManager (version 1.46.0-20.el9_4) is starting... (after a restart, boot:d7f8>
Dec 05 11:26:39 localhost NetworkManager[5287]: <info>  [1733397999.2333] Read config: /etc/NetworkManager/NetworkManager.conf
Dec 05 11:26:39 localhost NetworkManager[5287]: <error> [1733397999.2336] bus-manager: cannot connect to D-Bus: Could not connect: No such file or directory
Dec 05 11:26:39 localhost NetworkManager[5287]: <info>  [1733397999.2336] exiting (error)
Dec 05 11:26:39 localhost systemd[1]: NetworkManager.service: Main process exited, code=exited, status=1/FAILURE

Also rhc tool was affected by this problem:

ERROR  rhsm  cannot connect to Red Hat Subscription Management: dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory

Expected behavior

<your text here>

Screenshots

Attached Image

Device Info (please complete the following information):

Hardware Specs: [e.g. Apple M2 Pro Chip, 16 GB Memory, etc.]
OS Version: [e.g. Mac OS 14.4.1, Fedora Linux 40]
InstructLab Version: [output of \\\{{{}ilab --version{}}}]

Provide the output of these two commands:

sudo bootc status --format json | jq .status.booted.image.image.image

[root@localhost ~]# bootc status --format json | jq .status.booted.image.image.image
"registry.redhat.io/rhelai1/bootc-intel-rhel9:1.3-1733319681"

ilab system info to print detailed information about InstructLab version, OS, and hardware - including GPU / AI accelerator hardware

[root@localhost ~]# ilab system info
/usr/lib64/python3.11/inspect.py:389: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 288
CPU RAM       : -1919526024 KB
------------------------------------------------------------------------------
Platform:
  sys.version: 3.11.7 (main, Oct  9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.42.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: localhost
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 2265.40 GB
  memory.available: 2246.60 GB
  memory.used: 10.00 GB
InstructLab:
  instructlab.version: 0.21.0
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.4.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.1
  instructlab-sdg.version: 0.6.1
  instructlab-training.version: 0.6.1
Torch:
  torch.version: 2.4.0a0+git74cd574
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: None
  torch.version.hip: None
  torch.cuda.available: False
  torch.backends.cuda.is_built: False
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  habana_torch_plugin.version: 1.18.0.524
  torch.hpu.is_available: True
  torch.hpu.device_count: 8
  torch.hpu.0.name: GAUDI3
  torch.hpu.0.capability: 1.18.0.1b7f293
  torch.hpu.0.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
  torch.hpu.1.name: GAUDI3
  torch.hpu.1.capability: 1.18.0.1b7f293
  torch.hpu.1.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
  torch.hpu.2.name: GAUDI3
  torch.hpu.2.capability: 1.18.0.1b7f293
  torch.hpu.2.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
  torch.hpu.3.name: GAUDI3
  torch.hpu.3.capability: 1.18.0.1b7f293
  torch.hpu.3.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
  torch.hpu.4.name: GAUDI3
  torch.hpu.4.capability: 1.18.0.1b7f293
  torch.hpu.4.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
  torch.hpu.5.name: GAUDI3
  torch.hpu.5.capability: 1.18.0.1b7f293
  torch.hpu.5.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
  torch.hpu.6.name: GAUDI3
  torch.hpu.6.capability: 1.18.0.1b7f293
  torch.hpu.6.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
  torch.hpu.7.name: GAUDI3
  torch.hpu.7.capability: 1.18.0.1b7f293
  torch.hpu.7.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=18, device_id=0, device_type=5
  env.HABANA_LOGS: /var/log/habana_logs/
  env.HABANA_PLUGINS_LIB_PATH: /opt/habanalabs/habana_plugins
  env.HABANA_PROFILE: profile_api_light
  env.HABANA_SCAL_BIN_PATH: /opt/habanalabs/engines_fw
llama_cpp_python:
  llama_cpp_python.version: 0.2.79
  llama_cpp_python.supports_gpu_offload: False
[root@localhost ~]#

Additional context