-
Initiative
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
False
-
-
True
-
50% To Do, 50% In Progress, 0% Done
-
XL
Feature title: Support multiple versions of accelerator toolkits
Feature Overview:
This initiative introduces a new strategy to support multiple versions of accelerator toolkits (like CUDA) concurrently within the AI Platform Core Components. Currently, our container images use generic tags (e.g., cuda-ubi9), which prevents us from hosting different toolkit versions simultaneously. This limitation is blocking the RHAIIS team, which requires both CUDA 12.9.1 and 12.8.1 for upcoming releases. By transitioning to version-specific image names (e.g., cuda-12.8-ubi9), we will enable teams to use the optimal accelerator version for their specific workloads, improving performance and compatibility without resorting to custom, unsustainable build pipelines.
Product(s) associated:
RHAIIS: yes
RHEL AI: yes
RHOAI: yes
Goals:
This Feature enables development teams to build and deploy applications using multiple, concurrent versions of accelerator toolkits. The RHAIIS team will be able to build their llm-d and vllm images with two different required CUDA versions for their upcoming releases. This change moves us from a state where all workloads are forced to use a single accelerator version to a more flexible and sustainable model where our internal build processes can produce version-specific images on demand, reducing dependency on upstream builds.
Requirements:
- A new naming convention for Application Base Images must be implemented where the accelerator version is in the image name, not the tag (e.g., quay.io/aipcc/base-images/cuda-12.8-el9.6:3.0-175…).
- A new naming convention for Builder Images must be implemented with the format cuda-{major}.{minor}-{base-os} (e.g., cuda-12.8-ubi9).
- Build pipelines must be parameterized to accept CUDA versions as input and modified to build multiple versions concurrently.
- To maintain backward compatibility with existing plugins, the FROMAGER_VARIANT environment variable inside the new images will be set to the generic name (e.g., cuda-ubi9 for the cuda-12.9-ubi9 image).
- A process must be established for teams to request a new Quay repository via a Jira EPIC when a new accelerator version is needed.
- The Accelerator Enablement Team is responsible for implementing build pipeline changes and creating new images and variants.
- The Development Platform and Productization teams will support the Accelerator Enablement team if changes are needed in common infrastructure.
Done - Acceptance Criteria:
- The build pipeline can concurrently build container images for CUDA 12.9.1 and 12.8.1.
- Application Base Images for CUDA 12.8 and 12.9 are available and follow the new naming convention.
- Builder Images for CUDA 12.8 and 12.9 are available and follow the new naming convention.
- The RHAIIS team can consume the new base images to build their llm-d image (CUDA 12.9.1) and vllm image (CUDA 12.8.1).
- Validation tests are implemented to ensure that different accelerator versions are not mixed within a single built image.
- The FROMAGER_VARIANT variable within a versioned image (e.g., cuda-12.9-ubi9) is set to its corresponding generic name (cuda-ubi9).
Use Cases - i.e. User Experience & Workflow:
The RHAIIS team requires two distinct CUDA versions for their upcoming releases.
- The team needs to build the llm-d image, which requires CUDA 12.9.1.
- The team also needs to build the vllm image, which requires CUDA 12.8.1.
Out of Scope:
- Immediate migration of all existing images.
- Support for all historical CUDA versions.
- Support for multiple CUDA versions within a single container image.
- Changing the names of customer-facing product images for RHAIIS or RHEL AI.
Documentation Considerations :
- Document the new version support and End-of-Life (EOL) policy for accelerators.
- Maintain clear compatibility matrix documentation to manage the increased complexity of version combinations.
- links to