Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-3247

ilab command does not work as non-root user

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • rhelai-1.4
    • rhelai-1.4
    • Containers
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      I originally hit this with an AMD instance but I also tried `registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.4-1738329879` and seem to be hitting the same problem so this doesn't appear to be a amd-specific issue even though the commit I reference is from bootc-amd.

      I also want to emphasize that while ilab commands can still be run as root, that may not be a solution longer term due to the default partitioning layout - the default partition scheme on bare metal leaves barely enough disk space to bootc switch much less download models, do sdg or import data for training.

      Show
      I originally hit this with an AMD instance but I also tried `registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.4-1738329879` and seem to be hitting the same problem so this doesn't appear to be a amd-specific issue even though the commit I reference is from bootc-amd. I also want to emphasize that while ilab commands can still be run as root, that may not be a solution longer term due to the default partitioning layout - the default partition scheme on bare metal leaves barely enough disk space to bootc switch much less download models, do sdg or import data for training.
    • Critical
    • Proposed

      To Reproduce Steps to reproduce the behavior:

      1. log into rhel ai instance as a non-root user containing this change
      2. run `ilab`
      $ ilab system info
      Trying to pull registry.stage.redhat.io/rhelai1/instructlab-amd-rhel9:1.4-1738264488...
      Error: initializing source docker://registry.stage.redhat.io/rhelai1/instructlab-amd-rhel9:1.4-1738264488: unable to retrieve auth token: invalid username/password: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication

      Expected behavior

      • ilab command should work as non-root user

      Device Info (please complete the following information):

      • Hardware Specs: x86_64, MI300X bare metal
      • OS Version: RHEL AI 1.4
      • InstructLab Version: 0.23.1
      • Provide the output of these two commands:
        • sudo bootc status --format json | jq .status.booted.image.image.image
          • "registry.stage.redhat.io/rhelai1/bootc-amd-rhel9:1.4-1738329869"
        • ilab system info

       

      # ilab system info
      Platform:
        sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
        sys.platform: linux
        os.name: posix
        platform.release: 5.14.0-427.50.1.el9_4.x86_64
        platform.machine: x86_64
        platform.node: GPUF333
        platform.python_version: 3.11.7
        os-release.ID: rhel
        os-release.VERSION_ID: 9.4
        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
        memory.total: 3023.54 GB
        memory.available: 2983.71 GB
        memory.used: 30.85 GB
      InstructLab:
        instructlab.version: 0.23.1
        instructlab-dolomite.version: 0.2.0
        instructlab-eval.version: 0.5.1
        instructlab-quantize.version: 0.1.0
        instructlab-schema.version: 0.4.2
        instructlab-sdg.version: 0.7.0
        instructlab-training.version: 0.7.0
      Torch:
        torch.version: 2.4.1
        torch.backends.cpu.capability: AVX512
        torch.version.cuda: None
        torch.version.hip: 6.2.41134-65d174c3e
        torch.cuda.available: True
        torch.backends.cuda.is_built: True
        torch.backends.mps.is_built: False
        torch.backends.mps.is_available: False
        torch.cuda.bf16: True
        torch.cuda.current.device: 0
        torch.cuda.0.name: AMD Radeon Graphics
        torch.cuda.0.free: 191.4 GB
        torch.cuda.0.total: 192.0 GB
        torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.1.name: AMD Radeon Graphics
        torch.cuda.1.free: 191.4 GB
        torch.cuda.1.total: 192.0 GB
        torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.2.name: AMD Radeon Graphics
        torch.cuda.2.free: 191.4 GB
        torch.cuda.2.total: 192.0 GB
        torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.3.name: AMD Radeon Graphics
        torch.cuda.3.free: 191.4 GB
        torch.cuda.3.total: 192.0 GB
        torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.4.name: AMD Radeon Graphics
        torch.cuda.4.free: 191.4 GB
        torch.cuda.4.total: 192.0 GB
        torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.5.name: AMD Radeon Graphics
        torch.cuda.5.free: 191.4 GB
        torch.cuda.5.total: 192.0 GB
        torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.6.name: AMD Radeon Graphics
        torch.cuda.6.free: 191.4 GB
        torch.cuda.6.total: 192.0 GB
        torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.7.name: AMD Radeon Graphics
        torch.cuda.7.free: 191.4 GB
        torch.cuda.7.total: 192.0 GB
        torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      llama_cpp_python:
        llama_cpp_python.version: 0.3.2
        llama_cpp_python.supports_gpu_offload: False
      

       

      Bug impact

      • users will not be able to use ilab on affected RHEL AI images

      Known workaround

      • run all ilab commands as root can work but will almost certainly hit disk space issues due to default partitioning scheme

              fdupont@redhat.com Fabien Dupont
              tflink Tim Flink
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: