Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-1500

RHEL AI - podman's fuse-overlayfs has heavy CPU usage and hangs the vLLM server startup

    • False
    • Hide

      None

      Show
      None
    • False
    • Known Issue
    • AP Sprint 10, AP Sprint 11, AP Sprint 12, AP Sprint 13, AP Sprint 14, AP Sprint 15, AP Sprint 16, AP Sprint 17, AP Sprint 18, AP Sprint 19, DP Sprint 20
    • Approved

      Note: This bug is for the underlying performance issue behind AIPCC-1498.

      To Reproduce Steps to reproduce the behavior:

      1. Deploy RHEL AI v1.5 onto IBM Cloud (requires a data disk) or any other cloud instance with a data disk
      2. Follow our IBM-Cloud-specific commands to create a data disk
        1. Format with XFS
        2. mount it at /mnt
        3. Set ILAB_HOME to /mnt
        4. Copy /etc/skel/.config/containers/storage.conf (the default one included in RHEL AI) to /mnt/.config/containers/storage.conf
      3. Prepare Instructlab (ilab config init, download models)
      4. Run ilab data generate
      5. While the aforementioned command slowly starts the vLLM server  run `top` and `ps -ef | grep fuse-overlayfs`.
      6. Run ilab model train (short)
      7. While the aforementioned command slowly starts the vLLM server (about 50 minutes in, and then about 80 minutes in)  run `top` and `ps -ef | grep fuse-overlayfs`.

      Expected behavior

      • fuse-overlayfs does not ever max out a CPU core for several minutes, or even longer.
      • It is a process like
        cloud-u+   42076       1 27 19:08 ?        00:14:33 /usr/bin/fuse-overlayfs -o lowerdir=/usr/lib/containers/storage/overlay/l/L4SRIYYHPYELNKLS6HUVIKLMER,upperdir=/mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/diff,workdir=/mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/work,volatile,context="system_u:object_r:container_file_t:s0:c1022,c1023",xattr_permissions=2 /mnt/.local/share/containers/storage/overlay/a51998ca987d2012b6dca0a727120ae2ebce10abe25d0dc81d19dcaf92db0c19/merged
      • Neither `ilab data generate` nor `ilab model short` are lengthily hung by it

      Actual behavior:

      1. fuse-overlayfs maxes out a CPU core for at least a several minutes or so at the beginning of `ilab data generate`
      2. It also does this for at least a dozen minutes or so, about 50 minutes into `ilab model train`
      3. It then does it even longer, about 80 minutes into `ilab model train`. 

       

      Device Info (please complete the following information):

      • Hardware Specs: gx3d-208x1792x8mi300x (8*MI300X)
      • OS Version: RHEL AI 1.5-7 Prod
      • InstructLab Version: 0.26.1
      • Provide the output of these two commands:
        • sudo bootc status --format json | jq .status.booted.image.image.image : "registry.redhat.io/rhelai1/bootc-amd-rhel9:1.5"
        • ilab system info
      Platform:
        sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
        sys.platform: linux
        os.name: posix
        platform.release: 5.14.0-427.65.1.el9_4.x86_64
        platform.machine: x86_64
        platform.node: mdepaulo-v157-amd-prod-2
        platform.python_version: 3.11.7
        os-release.ID: rhel
        os-release.VERSION_ID: 9.4
        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
        memory.total: 1763.82 GB
        memory.available: 1729.31 GB
        memory.used: 28.04 GBInstructLab:
        instructlab.version: 0.26.1
        instructlab-dolomite.version: 0.2.0
        instructlab-eval.version: 0.5.1
        instructlab-quantize.version: 0.1.0
        instructlab-schema.version: 0.4.2
        instructlab-sdg.version: 0.8.2
        instructlab-training.version: 0.10.2Torch:
        torch.version: 2.6.0
        torch.backends.cpu.capability: AVX512
        torch.version.cuda: None
        torch.version.hip: 6.3.42134-a9a80e791
        torch.cuda.available: True
        torch.backends.cuda.is_built: True
        torch.backends.mps.is_built: False
        torch.backends.mps.is_available: False
        torch.cuda.bf16: True
        torch.cuda.current.device: 0
        torch.cuda.0.name: AMD Radeon Graphics
        torch.cuda.0.free: 191.5 GB
        torch.cuda.0.total: 192.0 GB
        torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.1.name: AMD Radeon Graphics
        torch.cuda.1.free: 191.5 GB
        torch.cuda.1.total: 192.0 GB
        torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.2.name: AMD Radeon Graphics
        torch.cuda.2.free: 191.5 GB
        torch.cuda.2.total: 192.0 GB
        torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.3.name: AMD Radeon Graphics
        torch.cuda.3.free: 191.5 GB
        torch.cuda.3.total: 192.0 GB
        torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.4.name: AMD Radeon Graphics
        torch.cuda.4.free: 191.5 GB
        torch.cuda.4.total: 192.0 GB
        torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.5.name: AMD Radeon Graphics
        torch.cuda.5.free: 191.5 GB
        torch.cuda.5.total: 192.0 GB
        torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.6.name: AMD Radeon Graphics
        torch.cuda.6.free: 191.5 GB
        torch.cuda.6.total: 192.0 GB
        torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.7.name: AMD Radeon Graphics
        torch.cuda.7.free: 191.5 GB
        torch.cuda.7.total: 192.0 GB
        torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
        llama_cpp_python.version: 0.3.6
        llama_cpp_python.supports_gpu_offload: False
       
        • The containers' storage.conf file:
        •  

       

      [storage]
      driver = "overlay"
      [storage.options]
      size = ""
      remap-uids = ""
      remap-gids = ""
      ignore_chown_errors = ""
      remap-user = ""
      remap-group = ""
      skip_mount_home = ""
      mount_program = "/usr/bin/fuse-overlayfs"
      mountopt = ""
      additionalimagestores = [ "/usr/lib/containers/storage",]
      [storage.options.overlay]
      force_mask = "shared"
       

       

       

      Bug impact

      • Training times and out and cannot be run without hugely raising the ilab `max_startup_attempts`

      Known workaround

      • None for this underlying performance issue.
      • InstructLab can accommodate it by raise the 2 instances of `max_startup_attempts` in ilab config.yaml from 120 to 1200 , per  AIPCC-1498

       

        1. ilab-config-show
          20 kB
          Mike DePaulo
        2. ilab-data-generate
          65 kB
          Mike DePaulo
        3. ilab-model-train.txt
          366 kB
          Mike DePaulo

              mdepaulo@redhat.com Mike DePaulo
              mdepaulo@redhat.com Mike DePaulo
              Antonio's Team
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: