-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
False
-
-
False
-
-
Be aware >> Rich Hardy said we already have an IT service based on the GPUaaS toolset in OpenShift AI called MOSAIC.
**
https://source.redhat.com/departments/it/ai_platforms/mosaic_platform
Goal{}
Build a clear and structured understanding of how IT manages GPUs today in RDU4 and what MOSAIC provides in practice.
Description{}
This story focuses on discovery and documentation of the current GPU management solution operated by IT in RDU4, Red Hat’s internal data center in the Raleigh/Durham region.
It is possible that the work required for this story is primarily a series of meetings and walkthroughs with the IT team, rather than technical access or deployment.
The objective is not to evaluate or judge the solution, but to document the current state as-is, including tooling, architecture, and operational practices.
Topics to be covered include:
- The feature set provided by MOSAIC or any other GPU management tooling in use
- The high-level architecture and main components involved
- How the solution helps IT optimize GPU usage and reduce idle resources
- Whether dashboards or observability tools exist, and what visibility they provide (who uses which GPUs, usage vs utilization)
- How GPU resources are allocated behind the scenes
- How quotas, priorities, and fair sharing are enforced
- How preemption is handled, if at all
- Whether GPU partitioning mechanisms such as MIG are used, and how
Out of scope{}
- Deploying or modifying IT-managed systems
- Running workloads or stress tests (it could be part of the story if the IT will give us access to do it)
- Comparing MOSAIC to other GPUaaS candidates
DoD{}
- A written summary document exists that includes:
- A bullet-point list of available features
- A high-level architecture overview
- An explanation of how GPU resources are allocated and reclaimed
- Details on prioritization, preemption, and any use of MIG
- Information on dashboards and visibility into GPU usage
- Notes on how the solution saves resources and improves utilization
The document is shared with the team and can be directly referenced in the GPUaaS evaluation and comparison phase.