-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
False
-
-
False
-
None
-
None
-
None
-
None
Sometimes, there could be problems[1] while creating the MIG slices[2] during the fetching of the GI profiles etc via the nvml library calls.
The calls could timeout or error out may be due to the underlying device health.
There should be some retry logic during such scenarios and the corresponding node accelerator object's status needs to be updated accordingly.
This status info could also help in deciding the scheduling of the new workload. i.e. The node having faulty devices can be tainted and the scheduler would take the decision of not scheduling a new workload on such device.
[1] - https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/68140/rehearse-68140-pull-ci-openshift-instaslice-operator-next-e2e-gpu-4-19/1957343396754362368
[2] - https://github.com/openshift/release/pull/68140#issuecomment-3196839984