-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
False
-
-
False
-
-
Goal{}
Build a clear and well-documented understanding of how GPU resources are managed on IBM Cloud today.
Description{}
This story focuses on discovery and documentation of GPU resource management practices on IBM Cloud, through direct coordination with the IBM Cloud team.
The work is not assumed to be technical access or deployment.
It is expected that the primary activity will be one or more working sessions with the IBM Cloud team to understand their current architecture, tooling, and operational model.
This story must be executed by at least two participants together, to ensure shared understanding, reduce single-point interpretation, and improve the quality of the output.
Primary contact: Kieran Forde
Topics to be covered include:
- How GPU resources are provisioned and managed on IBM Cloud
- The high-level architecture used for GPU allocation and scheduling
- What abstractions or services exist for GPU consumers
- How GPU usage and utilization are tracked
- Whether dashboards or observability tools exist, and what visibility they provide
- How resources are allocated behind the scenes
- How quotas, priorities, and fairness are handled
- Whether preemption is supported, and under what conditions
- Whether GPU partitioning mechanisms such as MIG are used
The goal is to document the as-is state, not to evaluate or compare solutions at this stage.
Out of scope{}
- Deploying workloads on IBM Cloud
- Running GPU benchmarks or stress tests
- Comparing IBM Cloud to other GPUaaS candidates
DoD{}
- A meeting (or series of meetings) with the IBM Cloud team is completed
- At least two team members participated in the sessions
- A written summary document exists that includes:
- A bullet-point list of available GPU management features
- A high-level architecture overview
- How GPU resources are allocated, shared, and reclaimed
- How prioritization, quotas, preemption, and MIG (if applicable) are handled
- What dashboards or visibility exist for usage and utilization
The document is shared with the team and can be directly referenced in the GPUaaS evaluation and comparison phase.