XMLWordPrintable

    • Icon: Feature Feature
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • VAF
    • None
    • Not Selected
    • False
    • False
    • Hide

      None

      Show
      None
    • M
    • 0
    • 0
    • rhos-workloads-vaf

      Feature Overview
      We need to provide a software, which is capable of partitioning vGPUs into different MIG configurations in a way that is consumable for Compute services such as Nova (near future) and Cyborg (far future). The feature will also provide a documentation on how to preconfigure MIG before using vGPU MIG-backed instances in RHOSO. Proper preconfiguration of MIG ensures that users can fully leverage MIG-backed vGPUs for their workloads with optimized resource partitioning and performance.

      Goals

      • To enable Compute service(s) (such as Nova and Cyborg) to manage MIG configuration and apply such configuration easily without additional manual pre-steps.
      • To provide documentation for users how to use this software within RHOSO product

      Who benefits from this Feature, and how?

      • Customers coming from other virtualization platforms will have c-series MIG-backed parity
      • AI/ML users benefit from clear guidance on setting up MIG for optimal GPU utilization.
      • RHOSO administrators can correctly configure MIG to avoid misconfigurations and ensure predictable GPU resource allocation.
      • Organizations can improve GPU efficiency by following best practices for MIG setup.
      • Data Scientist can run more inference services with more MIG slices with memory isolation.

      What is the difference between today’s current state and a world with this Feature?

      • Current State: No official RHOSO documentation exists for preconfiguring MIG before enabling vGPU MIG-backed instances.
      • Future State: Users have clear and detailed guidance on how to configure MIG properly before deploying workloads.

       

      Requirements :

      Requirement Notes isMVP?
      Ensure appropriate packaging in a product such as RPM or container yes
      Integrated compute node installation in edpm-ansible yes
      Provide RHOSO documentation on enabling and configuring MIG on supported NVIDIA GPUs Include step-by-step instructions and best practices Yes
      Detail how to verify MIG configuration before deploying vGPU-backed workloads Ensure users can validate MIG setup before proceeding Yes
      Explain integration with RHOSO Cover compatibility and necessary preconditions for MIG use in RHOSO Yes
      Provide troubleshooting guidelines for common MIG setup issues Help users quickly resolve misconfigurations No

       

      Done - Acceptance Criteria 

      • Suitable SW/driver is part of RHOSO product and appropriate installation steps are part of Ansible module(s) of RHOSO installers
      • User documentation of MIG management software is part of downstream documentation

      Use Cases - i.e. User Experience & Workflow:

      • A system administrator wants to enable MIG on NVIDIA GPUs before deploying vGPU MIG-backed workloads in RHOSO.
      • A user follows the documentation to partition a GPU using MIG and validate the setup. A troubleshooting guide helps users resolve issues when MIG is not detected correctly.

      Out of Scope:

      • Installation logic on OCP operator level as this will be added as part of deployment of specific service which will be use the MIG management software/driver
      • Implementation of MIG-backed vGPU support itself.
      • Changes to RHOSO scheduling mechanisms for vGPU resources.

      Documentation Considerations

      • Create a step-by-step guide with screenshots or CLI examples.
      • Provide verification commands and expected output for confirming MIG setup.
      • Link to NVIDIA official documentation where relevant.

      Questions to Answer:

      • What are the supported GPU models for MIG preconfiguration in RHOSO?
      • Are there any dependencies or prerequisites users need to be aware of before enabling MIG?
      • What are the recommended best practices for configuring MIG for AI/ML workloads?

      Background and Strategic Fit:

      • Ensuring MIG is properly preconfigured is a prerequisite for enabling vGPU MIG-backed support in RHOSO.
      • Providing clear documentation prevents user misconfigurations and improves overall GPU resource efficiency.
      • Aligns with industry best practices for AI/ML and HPC workload management.

      Customer Considerations:

      • Customers who rely on NVIDIA MIG for workload partitioning need official guidance on setup.
      • Organizations deploying RHOSO in multi-GPU environments benefit from clear instructions to optimize resource allocation.

      Team Sign Off (Completion while in Planning status)

      • All required Epics (known at the time) are linked to the this Feature
      • All required Stories, Tasks (known at the time) for the most immediate Epics have been created and estimated
      • Add - Reviewers name, Team Name
      • Acceptance == Feature as “Ready” - well understood and scope is clear - Acceptance Criteria (scope) is elaborated, well defined, and understood
      • Note: Only set FixVersion/s: on a Feature if the delivery team agrees they have the capacity and have committed that capability for that milestone
      Reviewed By Team Name Accepted Notes
             
             
             
             

       

              mmagr@redhat.com Martin Magr
              mmagr@redhat.com Martin Magr
              Sudhakar Molli Sudhakar Molli
              rhos-workloads-vaf
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: