Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-1500

RHEL AI - podman's fuse-overlayfs has heavy CPU usage and hangs the vLLM server startup

    • False
    • Hide

      None

      Show
      None
    • False
    • Known Issue
    • AP Sprint 10, AP Sprint 11, AP Sprint 12, AP Sprint 13, AP Sprint 14, AP Sprint 15, AP Sprint 16, AP Sprint 17
    • Approved

      Note: This bug is for the underlying performance issue behind AIPCC-1498.

      To Reproduce Steps to reproduce the behavior:

      1. Deploy RHEL AI v1.5 onto IBM Cloud (requires a data disk) or any other cloud instance with a data disk
      2. Follow our IBM-Cloud-specific commands to create a data disk
        1. Format with XFS
        2. mount it at /mnt
        3. Set ILAB_HOME to /mnt
        4. Copy /etc/skel/.config/containers/storage.conf (the default one included in RHEL AI) to /mnt/.config/containers/storage.conf
      3. Prepare Instructlab (ilab config init, download models)
      4. Run ilab data generate
      5. While the aforementioned command slowly starts the vLLM server  run `top` and `ps -ef | grep fuse-overlayfs`.
      6. Run ilab model train (short)
      7. While the aforementioned command slowly starts the vLLM server (about 50 minutes in, and then about 80 minutes in)  run `top` and `ps -ef | grep fuse-overlayfs`.

      Expected behavior

      • fuse-overlayfs does not ever max out a CPU core for several minutes, or even longer.
      • It is a process like
        cloud-u+   42076       1 27 19:08 ?        00:14:33 /usr/bin/fuse-overlayfs -o lowerdir=/usr/lib/containers/storage/overlay/l/L4SRIYYHPYELNKLS6HUVIKLMER,upperdir=/mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/diff,workdir=/mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/work,volatile,context="system_u:object_r:container_file_t:s0:c1022,c1023",xattr_permissions=2 /mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/merged
      • Neither `ilab data generate` nor `ilab model short` are lengthily hung by it

      Actual behavior:

      1. fuse-overlayfs maxes out a CPU core for at least a several minutes or so at the beginning of `ilab data generate`
      2. It also does this for at least a dozen minutes or so, about 50 minutes into `ilab model train`
      3. It then does it even longer, about 80 minutes into `ilab model train`. 

       

      Device Info (please complete the following information):

      • Hardware Specs: gx3d-208x1792x8mi300x (8*MI300X)
      • OS Version: RHEL AI 1.5-7 Prod
      • InstructLab Version: 0.26.1
      • Provide the output of these two commands:
        • sudo bootc status --format json | jq .status.booted.image.image.image : "registry.redhat.io/rhelai1/bootc-amd-rhel9:1.5"
        • ilab system info
      Platform:
        sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
        sys.platform: linux
        os.name: posix
        platform.release: 5.14.0-427.65.1.el9_4.x86_64
        platform.machine: x86_64
        platform.node: mdepaulo-v157-amd-prod-2
        platform.python_version: 3.11.7
        os-release.ID: rhel
        os-release.VERSION_ID: 9.4
        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
        memory.total: 1763.82 GB
        memory.available: 1729.31 GB
        memory.used: 28.04 GBInstructLab:
        instructlab.version: 0.26.1
        instructlab-dolomite.version: 0.2.0
        instructlab-eval.version: 0.5.1
        instructlab-quantize.version: 0.1.0
        instructlab-schema.version: 0.4.2
        instructlab-sdg.version: 0.8.2
        instructlab-training.version: 0.10.2Torch:
        torch.version: 2.6.0
        torch.backends.cpu.capability: AVX512
        torch.version.cuda: None
        torch.version.hip: 6.3.42134-a9a80e791
        torch.cuda.available: True
        torch.backends.cuda.is_built: True
        torch.backends.mps.is_built: False
        torch.backends.mps.is_available: False
        torch.cuda.bf16: True
        torch.cuda.current.device: 0
        torch.cuda.0.name: AMD Radeon Graphics
        torch.cuda.0.free: 191.5 GB
        torch.cuda.0.total: 192.0 GB
        torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.1.name: AMD Radeon Graphics
        torch.cuda.1.free: 191.5 GB
        torch.cuda.1.total: 192.0 GB
        torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.2.name: AMD Radeon Graphics
        torch.cuda.2.free: 191.5 GB
        torch.cuda.2.total: 192.0 GB
        torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.3.name: AMD Radeon Graphics
        torch.cuda.3.free: 191.5 GB
        torch.cuda.3.total: 192.0 GB
        torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.4.name: AMD Radeon Graphics
        torch.cuda.4.free: 191.5 GB
        torch.cuda.4.total: 192.0 GB
        torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.5.name: AMD Radeon Graphics
        torch.cuda.5.free: 191.5 GB
        torch.cuda.5.total: 192.0 GB
        torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.6.name: AMD Radeon Graphics
        torch.cuda.6.free: 191.5 GB
        torch.cuda.6.total: 192.0 GB
        torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.7.name: AMD Radeon Graphics
        torch.cuda.7.free: 191.5 GB
        torch.cuda.7.total: 192.0 GB
        torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
        llama_cpp_python.version: 0.3.6
        llama_cpp_python.supports_gpu_offload: False
       
        • The containers' storage.conf file:
        •  

       

      [storage]
      driver = "overlay"
      [storage.options]
      size = ""
      remap-uids = ""
      remap-gids = ""
      ignore_chown_errors = ""
      remap-user = ""
      remap-group = ""
      skip_mount_home = ""
      mount_program = "/usr/bin/fuse-overlayfs"
      mountopt = ""
      additionalimagestores = [ "/usr/lib/containers/storage",]
      [storage.options.overlay]
      force_mask = "shared"
       

       

       

      Bug impact

      • Training times and out and cannot be run without hugely raising the ilab `max_startup_attempts`

      Known workaround

      • None for this underlying performance issue.
      • InstructLab can accommodate it by raise the 2 instances of `max_startup_attempts` in ilab config.yaml from 120 to 1200 , per  AIPCC-1498

       

        1. ilab-model-train.txt
          366 kB
        2. ilab-data-generate
          65 kB
        3. ilab-config-show
          20 kB

              mdepaulo@redhat.com Mike DePaulo
              mdepaulo@redhat.com Mike DePaulo
              Antonio's Team
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: