-
Bug
-
Resolution: Unresolved
-
Critical
-
rhelai-1.5
-
False
-
-
False
-
Known Issue
-
-
-
AP Sprint 10, AP Sprint 11, AP Sprint 12, AP Sprint 13, AP Sprint 14, AP Sprint 15, AP Sprint 16, AP Sprint 17, AP Sprint 18, AP Sprint 19, DP Sprint 20
-
Approved
Note: This bug is for the underlying performance issue behind AIPCC-1498.
To Reproduce Steps to reproduce the behavior:
- Deploy RHEL AI v1.5 onto IBM Cloud (requires a data disk) or any other cloud instance with a data disk
- Follow our IBM-Cloud-specific commands to create a data disk
- Format with XFS
- mount it at /mnt
- Set ILAB_HOME to /mnt
- Copy /etc/skel/.config/containers/storage.conf (the default one included in RHEL AI) to /mnt/.config/containers/storage.conf
- Prepare Instructlab (ilab config init, download models)
- Run ilab data generate
- While the aforementioned command slowly starts the vLLM server run `top` and `ps -ef | grep fuse-overlayfs`.
- Run ilab model train (short)
- While the aforementioned command slowly starts the vLLM server (about 50 minutes in, and then about 80 minutes in) run `top` and `ps -ef | grep fuse-overlayfs`.
Expected behavior
- fuse-overlayfs does not ever max out a CPU core for several minutes, or even longer.
- It is a process like
cloud-u+ 42076 1 27 19:08 ? 00:14:33 /usr/bin/fuse-overlayfs -o lowerdir=/usr/lib/containers/storage/overlay/l/L4SRIYYHPYELNKLS6HUVIKLMER,upperdir=/mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/diff,workdir=/mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/work,volatile,context="system_u:object_r:container_file_t:s0:c1022,c1023",xattr_permissions=2 /mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/merged
- Neither `ilab data generate` nor `ilab model short` are lengthily hung by it
Actual behavior:
- fuse-overlayfs maxes out a CPU core for at least a several minutes or so at the beginning of `ilab data generate`
- It also does this for at least a dozen minutes or so, about 50 minutes into `ilab model train`
- It then does it even longer, about 80 minutes into `ilab model train`.
Device Info (please complete the following information):
- Hardware Specs: gx3d-208x1792x8mi300x (8*MI300X)
- OS Version: RHEL AI 1.5-7 Prod
- InstructLab Version: 0.26.1
- Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image : "registry.redhat.io/rhelai1/bootc-amd-rhel9:1.5"
- ilab system info :
Platform: sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.65.1.el9_4.x86_64 platform.machine: x86_64 platform.node: mdepaulo-v157-amd-prod-2 platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 1763.82 GB memory.available: 1729.31 GB memory.used: 28.04 GBInstructLab: instructlab.version: 0.26.1 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.5.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.2 instructlab-sdg.version: 0.8.2 instructlab-training.version: 0.10.2Torch: torch.version: 2.6.0 torch.backends.cpu.capability: AVX512 torch.version.cuda: None torch.version.hip: 6.3.42134-a9a80e791 torch.cuda.available: True torch.backends.cuda.is_built: True torch.backends.mps.is_built: False torch.backends.mps.is_available: False torch.cuda.bf16: True torch.cuda.current.device: 0 torch.cuda.0.name: AMD Radeon Graphics torch.cuda.0.free: 191.5 GB torch.cuda.0.total: 192.0 GB torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.1.name: AMD Radeon Graphics torch.cuda.1.free: 191.5 GB torch.cuda.1.total: 192.0 GB torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.2.name: AMD Radeon Graphics torch.cuda.2.free: 191.5 GB torch.cuda.2.total: 192.0 GB torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.3.name: AMD Radeon Graphics torch.cuda.3.free: 191.5 GB torch.cuda.3.total: 192.0 GB torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.4.name: AMD Radeon Graphics torch.cuda.4.free: 191.5 GB torch.cuda.4.total: 192.0 GB torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.5.name: AMD Radeon Graphics torch.cuda.5.free: 191.5 GB torch.cuda.5.total: 192.0 GB torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.6.name: AMD Radeon Graphics torch.cuda.6.free: 191.5 GB torch.cuda.6.total: 192.0 GB torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.7.name: AMD Radeon Graphics torch.cuda.7.free: 191.5 GB torch.cuda.7.total: 192.0 GB torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python: llama_cpp_python.version: 0.3.6 llama_cpp_python.supports_gpu_offload: False
-
- The containers' storage.conf file:
[storage] driver = "overlay" [storage.options] size = "" remap-uids = "" remap-gids = "" ignore_chown_errors = "" remap-user = "" remap-group = "" skip_mount_home = "" mount_program = "/usr/bin/fuse-overlayfs" mountopt = "" additionalimagestores = [ "/usr/lib/containers/storage",] [storage.options.overlay] force_mask = "shared"
Bug impact
- Training times and out and cannot be run without hugely raising the ilab `max_startup_attempts`
Known workaround
- None for this underlying performance issue.
- InstructLab can accommodate it by raise the 2 instances of `max_startup_attempts` in ilab config.yaml from 120 to 1200 , per AIPCC-1498
- causes
-
AIPCC-1498 RHEL AI 1.5 - vLLM fails to start on during training when using a separate data disk
-
- Review
-
- clones
-
AIPCC-1498 RHEL AI 1.5 - vLLM fails to start on during training when using a separate data disk
-
- Review
-