Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-5943

Support multiple versions of accelerator toolkits

    • False
    • Hide

      None

      Show
      None
    • True
    • 50% To Do, 50% In Progress, 0% Done
    • XL

      Feature title:  Support multiple versions of accelerator toolkits

      Feature Overview:

      This initiative introduces a new strategy to support multiple versions of accelerator toolkits (like CUDA) concurrently within the AI Platform Core Components. Currently, our container images use generic tags (e.g., cuda-ubi9), which prevents us from hosting different toolkit versions simultaneously. This limitation is blocking the RHAIIS team, which requires both CUDA 12.9.1 and 12.8.1 for upcoming releases. By transitioning to version-specific image names (e.g., cuda-12.8-ubi9), we will enable teams to use the optimal accelerator version for their specific workloads, improving performance and compatibility without resorting to custom, unsustainable build pipelines.

      Product(s) associated:

      RHAIIS: yes
      RHEL AI: yes
      RHOAI: yes

      Goals:

      This Feature enables development teams to build and deploy applications using multiple, concurrent versions of accelerator toolkits. The RHAIIS team will be able to build their llm-d and vllm images with two different required CUDA versions for their upcoming releases. This change moves us from a state where all workloads are forced to use a single accelerator version to a more flexible and sustainable model where our internal build processes can produce version-specific images on demand, reducing dependency on upstream builds.

      Requirements:

      • A new naming convention for Application Base Images must be implemented where the accelerator version is in the image name, not the tag (e.g., quay.io/aipcc/base-images/cuda-12.8-el9.6:3.0-175…).
      • A new naming convention for Builder Images must be implemented with the format cuda-{major}.{minor}-{base-os} (e.g., cuda-12.8-ubi9).
      • Build pipelines must be parameterized to accept CUDA versions as input and modified to build multiple versions concurrently.
      • To maintain backward compatibility with existing plugins, the FROMAGER_VARIANT environment variable inside the new images will be set to the generic name (e.g., cuda-ubi9 for the cuda-12.9-ubi9 image).
      • A process must be established for teams to request a new Quay repository via a Jira EPIC when a new accelerator version is needed.
      • The Accelerator Enablement Team is responsible for implementing build pipeline changes and creating new images and variants.
      • The Development Platform and Productization teams will support the Accelerator Enablement team if changes are needed in common infrastructure.

      Done - Acceptance Criteria:

      • The build pipeline can concurrently build container images for CUDA 12.9.1 and 12.8.1.
      • Application Base Images for CUDA 12.8 and 12.9 are available and follow the new naming convention.
      • Builder Images for CUDA 12.8 and 12.9 are available and follow the new naming convention.
      • The RHAIIS team can consume the new base images to build their llm-d image (CUDA 12.9.1) and vllm image (CUDA 12.8.1).
      • Validation tests are implemented to ensure that different accelerator versions are not mixed within a single built image.
      • The FROMAGER_VARIANT variable within a versioned image (e.g., cuda-12.9-ubi9) is set to its corresponding generic name (cuda-ubi9).

      Use Cases - i.e. User Experience & Workflow:

      The RHAIIS team requires two distinct CUDA versions for their upcoming releases.

      • The team needs to build the llm-d image, which requires CUDA 12.9.1.
      • The team also needs to build the vllm image, which requires CUDA 12.8.1.

      Out of Scope:

      • Immediate migration of all existing images.
      • Support for all historical CUDA versions.
      • Support for multiple CUDA versions within a single container image.
      • Changing the names of customer-facing product images for RHAIIS or RHEL AI.

      Documentation Considerations :

      • Document the new version support and End-of-Life (EOL) policy for accelerators.
      • Maintain clear compatibility matrix documentation to manage the increased complexity of version combinations.

              emacchi@redhat.com Emilien Macchi
              emacchi@redhat.com Emilien Macchi
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: