Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: rhelai-1.5
Affects Version/s: rhelai-1.5
Component/s: Development Platform
Labels:
- amd
- vllm

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Release Note Type:
Known Issue
Intelligence Requested:
Market:

Sprint:
AP Sprint 10, AP Sprint 11, AP Sprint 12, AP Sprint 13, AP Sprint 14, AP Sprint 15, AP Sprint 16, AP Sprint 17, AP Sprint 18, AP Sprint 19, DP Sprint 20, DP Sprint 22, DP Sprint 23

Release Blocker:
Approved

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Note: This bug is for the underlying performance issue behind AIPCC-1498.

To Reproduce Steps to reproduce the behavior:

Deploy RHEL AI v1.5 onto IBM Cloud (requires a data disk) or any other cloud instance with a data disk
Follow our IBM-Cloud-specific commands to create a data disk
1. Format with XFS
2. mount it at /mnt
3. Set ILAB_HOME to /mnt
4. Copy /etc/skel/.config/containers/storage.conf (the default one included in RHEL AI) to /mnt/.config/containers/storage.conf
Prepare Instructlab (ilab config init, download models)
Run ilab data generate
While the aforementioned command slowly starts the vLLM server run `top` and `ps -ef | grep fuse-overlayfs`.
Run ilab model train (short)
While the aforementioned command slowly starts the vLLM server (about 50 minutes in, and then about 80 minutes in) run `top` and `ps -ef | grep fuse-overlayfs`.

Expected behavior

fuse-overlayfs does not ever max out a CPU core for several minutes, or even longer.

It is a process like

cloud-u+   42076       1 27 19:08 ?        00:14:33 /usr/bin/fuse-overlayfs -o lowerdir=/usr/lib/containers/storage/overlay/l/L4SRIYYHPYELNKLS6HUVIKLMER,upperdir=/mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/diff,workdir=/mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/work,volatile,context="system_u:object_r:container_file_t:s0:c1022,c1023",xattr_permissions=2 /mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/merged

Neither `ilab data generate` nor `ilab model short` are lengthily hung by it

Actual behavior:

fuse-overlayfs maxes out a CPU core for at least a several minutes or so at the beginning of `ilab data generate`
It also does this for at least a dozen minutes or so, about 50 minutes into `ilab model train`
It then does it even longer, about 80 minutes into `ilab model train`.

Device Info (please complete the following information):

Hardware Specs: gx3d-208x1792x8mi300x (8*MI300X)
OS Version: RHEL AI 1.5-7 Prod
InstructLab Version: 0.26.1
Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image : "registry.redhat.io/rhelai1/bootc-amd-rhel9:1.5"
- ilab system info :

Platform:
  sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.65.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: mdepaulo-v157-amd-prod-2
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 1763.82 GB
  memory.available: 1729.31 GB
  memory.used: 28.04 GBInstructLab:
  instructlab.version: 0.26.1
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.5.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.2
  instructlab-sdg.version: 0.8.2
  instructlab-training.version: 0.10.2Torch:
  torch.version: 2.6.0
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: None
  torch.version.hip: 6.3.42134-a9a80e791
  torch.cuda.available: True
  torch.backends.cuda.is_built: True
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  torch.cuda.bf16: True
  torch.cuda.current.device: 0
  torch.cuda.0.name: AMD Radeon Graphics
  torch.cuda.0.free: 191.5 GB
  torch.cuda.0.total: 192.0 GB
  torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.1.name: AMD Radeon Graphics
  torch.cuda.1.free: 191.5 GB
  torch.cuda.1.total: 192.0 GB
  torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.2.name: AMD Radeon Graphics
  torch.cuda.2.free: 191.5 GB
  torch.cuda.2.total: 192.0 GB
  torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.3.name: AMD Radeon Graphics
  torch.cuda.3.free: 191.5 GB
  torch.cuda.3.total: 192.0 GB
  torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.4.name: AMD Radeon Graphics
  torch.cuda.4.free: 191.5 GB
  torch.cuda.4.total: 192.0 GB
  torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.5.name: AMD Radeon Graphics
  torch.cuda.5.free: 191.5 GB
  torch.cuda.5.total: 192.0 GB
  torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.6.name: AMD Radeon Graphics
  torch.cuda.6.free: 191.5 GB
  torch.cuda.6.total: 192.0 GB
  torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.7.name: AMD Radeon Graphics
  torch.cuda.7.free: 191.5 GB
  torch.cuda.7.total: 192.0 GB
  torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
  llama_cpp_python.version: 0.3.6
  llama_cpp_python.supports_gpu_offload: False

- The containers' storage.conf file:

[storage]
driver = "overlay"
[storage.options]
size = ""
remap-uids = ""
remap-gids = ""
ignore_chown_errors = ""
remap-user = ""
remap-group = ""
skip_mount_home = ""
mount_program = "/usr/bin/fuse-overlayfs"
mountopt = ""
additionalimagestores = [ "/usr/lib/containers/storage",]
[storage.options.overlay]
force_mask = "shared"

Bug impact

Training times and out and cannot be run without hugely raising the ilab `max_startup_attempts`

Known workaround

None for this underlying performance issue.
InstructLab can accommodate it by raise the 2 instances of `max_startup_attempts` in ilab config.yaml from 120 to 1200 , per AIPCC-1498

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

ilab-config-show
20 kB
2025/05/19 8:46 PM
ilab-data-generate
65 kB
2025/05/19 8:46 PM
ilab-model-train.txt
366 kB
2025/05/19 8:46 PM

causes

AIPCC-1498 RHEL AI 1.5 - vLLM fails to start on during training when using a separate data disk

Review

clones

AIPCC-1498 RHEL AI 1.5 - vLLM fails to start on during training when using a separate data disk

Review

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty