Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-7548

Add facade NVIDIA Python packages to CUDA base image

XMLWordPrintable

    • Icon: Feature Feature
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • Accelerator Enablement
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      Feature title:  Add facade NVIDIA Python packages to CUDA base image

      Feature Overview:
      Upstream PyPI packages like PyTorch depend on several NVIDIA python wheels like nvidia-cuda-runtime-cu12. In downstream we use system packages for CUDA stack. Our base images and downstream Python index do not provide the packages. We should pre-populate the CUDA base image with facade packages that point to system libraries and headers.

      Product(s) associated:

      RHAIIS: no
      RHEL AI: no
      RHOAI: yes

      Goals:
      Provide high-level goal statement with user context and expected user outcome(s) for this Feature

      • Our deliverables look more similar to upstream content.
      • Users can figure out CUDA version and packages with pip / uv list (like with upstream content).
      • Users get better feedback when they try to mix upstream with downstream content. pip will refuse to remove or update a package

      Requirements:

      • CUDA base image comes with facade packages pre-installed.
      • package name and package versions are taken from RPM
      • all packages have a proper METADATA and INSTALLER file in the dist-info directory but no record file
      • lib and include subdirectories point to correct system locations (nvshmem is different!)
        • nvidia/cuda_runtime/lib -> /usr/local/cuda/lib64
        • nvidia/cuda_runtime/include -> /usr/local/cuda/include

      Since the packages need to match RPMs and have a symlink to external resources (not supported by wheel format), the packages could be created by a script.

      Done - Acceptance Criteria:
      All nvidia Python packages typically used by Torch and vLLM are present in the base image

      Use Cases - i.e. User Experience & Workflow:
      Users try to replace nvidia Python wheel and get a clear error message.

      Out of Scope:
      TBD

      Documentation Considerations :
      Presence of package need to be documented, at least internally in base image repo

      Details

      List for Torch 2.9.0

      nvidia-cublas-cu12==12.8.4.1
      nvidia-cuda-cupti-cu12==12.8.90
      nvidia-cuda-nvrtc-cu12==12.8.93
      nvidia-cuda-runtime-cu12==12.8.90
      nvidia-cudnn-cu12==9.10.2.21
      nvidia-cufft-cu12==11.3.3.83
      nvidia-cufile-cu12==1.13.1.3
      nvidia-curand-cu12==10.3.9.90
      nvidia-cusolver-cu12==11.7.3.90
      nvidia-cusparse-cu12==12.5.8.93
      nvidia-cusparselt-cu12==0.7.1
      nvidia-nccl-cu12==2.27.5
      nvidia-nvjitlink-cu12==12.8.93
      nvidia-nvshmem-cu12==3.3.20
      nvidia-nvtx-cu12==12.8.90
      

      Pip refusing to update a package:

      $ rm nvidia_cuda_runtime_cu12-12.8.90.dist-info/RECORD
      $ echo "RHAI system package" > $ rm nvidia_cuda_runtime_cu12-12.8.90.dist-info/INSTALLER
      $ pip install -U nvidia_cuda_runtime_cu12
      ...
      ERROR: Cannot uninstall nvidia-cuda-runtime-cu12 12.8.90, RECORD file not found. Hint: The package was installed by RHAI system package.

       

              Unassigned Unassigned
              cheimes@redhat.com Christian Heimes
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: