Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-10112

Engage IT team to understand MOSAIC and current GPU management in RDU4

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • Model Validation

      Be aware >>  Rich Hardy said we already have an IT service based on the GPUaaS toolset in OpenShift AI called MOSAIC.
       ** 
      https://source.redhat.com/departments/it/ai_platforms/mosaic_platform 

      Goal{}

      Build a clear and structured understanding of how IT manages GPUs today in RDU4 and what MOSAIC provides in practice.

       

      Description{}

      This story focuses on discovery and documentation of the current GPU management solution operated by IT in RDU4, Red Hat’s internal data center in the Raleigh/Durham region.

       

      It is possible that the work required for this story is primarily a series of meetings and walkthroughs with the IT team, rather than technical access or deployment.

       

      The objective is not to evaluate or judge the solution, but to document the current state as-is, including tooling, architecture, and operational practices.

       

      Topics to be covered include:

      • The feature set provided by MOSAIC or any other GPU management tooling in use
      • The high-level architecture and main components involved
      • How the solution helps IT optimize GPU usage and reduce idle resources
      • Whether dashboards or observability tools exist, and what visibility they provide (who uses which GPUs, usage vs utilization)
      • How GPU resources are allocated behind the scenes
      • How quotas, priorities, and fair sharing are enforced
      • How preemption is handled, if at all
      • Whether GPU partitioning mechanisms such as MIG are used, and how

       

      Out of scope{}

      • Deploying or modifying IT-managed systems
      • Running workloads or stress tests (it could be part of the story if the IT will give us access to do it) 
      • Comparing MOSAIC to other GPUaaS candidates

       

      DoD{}

      • A written summary document exists that includes:
      • A bullet-point list of available features
      • A high-level architecture overview
      • An explanation of how GPU resources are allocated and reclaimed
      • Details on prioritization, preemption, and any use of MIG
      • Information on dashboards and visibility into GPU usage
      • Notes on how the solution saves resources and improves utilization

       

      The document is shared with the team and can be directly referenced in the GPUaaS evaluation and comparison phase.

              wspinks@redhat.com Wesley Spinks
              rh-ee-abadli Aviran Badli
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: