-
Bug
-
Resolution: Unresolved
-
Critical
-
rhelai-1.5
-
False
-
-
False
-
Known Issue
-
-
-
AP Sprint 10, AP Sprint 11, AP Sprint 12, AP Sprint 13, AP Sprint 14, AP Sprint 15, AP Sprint 16, AP Sprint 17
-
Approved
Note: This bug is for the underlying performance issue behind AIPCC-1498.
To Reproduce Steps to reproduce the behavior:
- Deploy RHEL AI v1.5 onto IBM Cloud (requires a data disk) or any other cloud instance with a data disk
- Follow our IBM-Cloud-specific commands to create a data disk
- Format with XFS
- mount it at /mnt
- Set ILAB_HOME to /mnt
- Copy /etc/skel/.config/containers/storage.conf (the default one included in RHEL AI) to /mnt/.config/containers/storage.conf
- Prepare Instructlab (ilab config init, download models)
- Run ilab data generate
- While the aforementioned command slowly starts the vLLM server run `top` and `ps -ef | grep fuse-overlayfs`.
- Run ilab model train (short)
- While the aforementioned command slowly starts the vLLM server (about 50 minutes in, and then about 80 minutes in) run `top` and `ps -ef | grep fuse-overlayfs`.
Expected behavior
- fuse-overlayfs does not ever max out a CPU core for several minutes, or even longer.
- It is a process like
cloud-u+ 42076 1 27 19:08 ? 00:14:33 /usr/bin/fuse-overlayfs -o lowerdir=/usr/lib/containers/storage/overlay/l/L4SRIYYHPYELNKLS6HUVIKLMER,upperdir=/mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/diff,workdir=/mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/work,volatile,context="system_u:object_r:container_file_t:s0:c1022,c1023",xattr_permissions=2 /mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/merged
- Neither `ilab data generate` nor `ilab model short` are lengthily hung by it
Actual behavior:
- fuse-overlayfs maxes out a CPU core for at least a several minutes or so at the beginning of `ilab data generate`
- It also does this for at least a dozen minutes or so, about 50 minutes into `ilab model train`
- It then does it even longer, about 80 minutes into `ilab model train`.
Device Info (please complete the following information):
- Hardware Specs: gx3d-208x1792x8mi300x (8*MI300X)
- OS Version: RHEL AI 1.5-7 Prod
- InstructLab Version: 0.26.1
- Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image : "registry.redhat.io/rhelai1/bootc-amd-rhel9:1.5"
- ilab system info :
Platform: sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.65.1.el9_4.x86_64 platform.machine: x86_64 platform.node: mdepaulo-v157-amd-prod-2 platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 1763.82 GB memory.available: 1729.31 GB memory.used: 28.04 GBInstructLab: instructlab.version: 0.26.1 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.5.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.2 instructlab-sdg.version: 0.8.2 instructlab-training.version: 0.10.2Torch: torch.version: 2.6.0 torch.backends.cpu.capability: AVX512 torch.version.cuda: None torch.version.hip: 6.3.42134-a9a80e791 torch.cuda.available: True torch.backends.cuda.is_built: True torch.backends.mps.is_built: False torch.backends.mps.is_available: False torch.cuda.bf16: True torch.cuda.current.device: 0 torch.cuda.0.name: AMD Radeon Graphics torch.cuda.0.free: 191.5 GB torch.cuda.0.total: 192.0 GB torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.1.name: AMD Radeon Graphics torch.cuda.1.free: 191.5 GB torch.cuda.1.total: 192.0 GB torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.2.name: AMD Radeon Graphics torch.cuda.2.free: 191.5 GB torch.cuda.2.total: 192.0 GB torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.3.name: AMD Radeon Graphics torch.cuda.3.free: 191.5 GB torch.cuda.3.total: 192.0 GB torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.4.name: AMD Radeon Graphics torch.cuda.4.free: 191.5 GB torch.cuda.4.total: 192.0 GB torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.5.name: AMD Radeon Graphics torch.cuda.5.free: 191.5 GB torch.cuda.5.total: 192.0 GB torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.6.name: AMD Radeon Graphics torch.cuda.6.free: 191.5 GB torch.cuda.6.total: 192.0 GB torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.7.name: AMD Radeon Graphics torch.cuda.7.free: 191.5 GB torch.cuda.7.total: 192.0 GB torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python: llama_cpp_python.version: 0.3.6 llama_cpp_python.supports_gpu_offload: False
-
- The containers' storage.conf file:
[storage] driver = "overlay" [storage.options] size = "" remap-uids = "" remap-gids = "" ignore_chown_errors = "" remap-user = "" remap-group = "" skip_mount_home = "" mount_program = "/usr/bin/fuse-overlayfs" mountopt = "" additionalimagestores = [ "/usr/lib/containers/storage",] [storage.options.overlay] force_mask = "shared"
Bug impact
- Training times and out and cannot be run without hugely raising the ilab `max_startup_attempts`
Known workaround
- None for this underlying performance issue.
- InstructLab can accommodate it by raise the 2 instances of `max_startup_attempts` in ilab config.yaml from 120 to 1200 , per AIPCC-1498
- causes
-
AIPCC-1498 RHEL AI 1.5 - vLLM fails to start on during training when using a separate data disk
-
- Review
-
- clones
-
AIPCC-1498 RHEL AI 1.5 - vLLM fails to start on during training when using a separate data disk
-
- Review
-