-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
DRA: e2e test suite that validates Nvidia GPU
-
In Progress
-
Product / Portfolio Work
-
-
36% To Do, 14% In Progress, 50% Done
-
False
-
-
False
-
Not Selected
-
None
-
None
-
None
This is the follow up work required to get the DRA e2e suite running in OpenShift CI as a periodic job
Goal:
- a periodic job in CI that provisions a cluster with gpu worker node and runs the e2e suite
Non Goal:
- Our focus is limited to validating workload with Nvidia GPU (using the DRA driver), we will not add support for any other vendor.
The e2e suite is being worked on here: https://github.com/openshift/origin/pull/29842 . It covers the following use cases now:
- define a common test spec (one pod, one container asking for a distinct GPU) that can be validated against both the example DRA driver and the Nvidia DRA driver. The goal is to have a spec that is expected to pass on both
- two containers, each asking for a distinct GPU; one container should not have access to the other's GPU
- MPS strategy
- TimeSlicing strategy
- static pre-partitioned MIG slices
- IPC using CUDA
Constraints:
- The Nvidia DRA driver is not part of the GPU operator yet, for now we install the Nvidia DRA driver using helm from Nvidia's official repo https://catalog.ngc.nvidia.com/orgs/nvidia/helm-charts/nvidia-dra-driver-gpu
- https://github.com/NVIDIA/gpu-operator/pull/1541 is where the integration of the DRA driver is done, once the GPU operator adds the DRA driver as an operand, we can install the driver using the clusterpolicy API of the operator