-
Feature
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
Not Selected
-
False
-
False
-
-
M
-
-
-
0
-
0
-
rhos-workloads-vaf
Feature Overview
We need to provide a software, which is capable of partitioning vGPUs into different MIG configurations in a way that is consumable for Compute services such as Nova (near future) and Cyborg (far future). The feature will also provide a documentation on how to preconfigure MIG before using vGPU MIG-backed instances in RHOSO. Proper preconfiguration of MIG ensures that users can fully leverage MIG-backed vGPUs for their workloads with optimized resource partitioning and performance.
Goals
- To enable Compute service(s) (such as Nova and Cyborg) to manage MIG configuration and apply such configuration easily without additional manual pre-steps.
- To provide documentation for users how to use this software within RHOSO product
Who benefits from this Feature, and how?
- Customers coming from other virtualization platforms will have c-series MIG-backed parity
- AI/ML users benefit from clear guidance on setting up MIG for optimal GPU utilization.
- RHOSO administrators can correctly configure MIG to avoid misconfigurations and ensure predictable GPU resource allocation.
- Organizations can improve GPU efficiency by following best practices for MIG setup.
- Data Scientist can run more inference services with more MIG slices with memory isolation.
What is the difference between today’s current state and a world with this Feature?
- Current State: No official RHOSO documentation exists for preconfiguring MIG before enabling vGPU MIG-backed instances.
- Future State: Users have clear and detailed guidance on how to configure MIG properly before deploying workloads.
Requirements :
| Requirement | Notes | isMVP? |
|---|---|---|
| Ensure appropriate packaging in a product | such as RPM or container | yes |
| Integrated compute node installation | in edpm-ansible | yes |
| Provide RHOSO documentation on enabling and configuring MIG on supported NVIDIA GPUs | Include step-by-step instructions and best practices | Yes |
| Detail how to verify MIG configuration before deploying vGPU-backed workloads | Ensure users can validate MIG setup before proceeding | Yes |
| Explain integration with RHOSO | Cover compatibility and necessary preconditions for MIG use in RHOSO | Yes |
| Provide troubleshooting guidelines for common MIG setup issues | Help users quickly resolve misconfigurations | No |
Done - Acceptance Criteria
- Suitable SW/driver is part of RHOSO product and appropriate installation steps are part of Ansible module(s) of RHOSO installers
- User documentation of MIG management software is part of downstream documentation
Use Cases - i.e. User Experience & Workflow:
- A system administrator wants to enable MIG on NVIDIA GPUs before deploying vGPU MIG-backed workloads in RHOSO.
- A user follows the documentation to partition a GPU using MIG and validate the setup. A troubleshooting guide helps users resolve issues when MIG is not detected correctly.
Out of Scope:
- Installation logic on OCP operator level as this will be added as part of deployment of specific service which will be use the MIG management software/driver
- Implementation of MIG-backed vGPU support itself.
- Changes to RHOSO scheduling mechanisms for vGPU resources.
Documentation Considerations
- Create a step-by-step guide with screenshots or CLI examples.
- Provide verification commands and expected output for confirming MIG setup.
- Link to NVIDIA official documentation where relevant.
Questions to Answer:
- What are the supported GPU models for MIG preconfiguration in RHOSO?
- Are there any dependencies or prerequisites users need to be aware of before enabling MIG?
- What are the recommended best practices for configuring MIG for AI/ML workloads?
Background and Strategic Fit:
- Ensuring MIG is properly preconfigured is a prerequisite for enabling vGPU MIG-backed support in RHOSO.
- Providing clear documentation prevents user misconfigurations and improves overall GPU resource efficiency.
- Aligns with industry best practices for AI/ML and HPC workload management.
Customer Considerations:
- Customers who rely on NVIDIA MIG for workload partitioning need official guidance on setup.
- Organizations deploying RHOSO in multi-GPU environments benefit from clear instructions to optimize resource allocation.
Team Sign Off (Completion while in Planning status)
- All required Epics (known at the time) are linked to the this Feature
- All required Stories, Tasks (known at the time) for the most immediate Epics have been created and estimated
- Add - Reviewers name, Team Name
- Acceptance == Feature as “Ready” - well understood and scope is clear - Acceptance Criteria (scope) is elaborated, well defined, and understood
- Note: Only set FixVersion/s: on a Feature if the delivery team agrees they have the capacity and have committed that capability for that milestone
| Reviewed By | Team Name | Accepted | Notes |
- …