Loading...

XML

Word

Printable

Type: Story
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
Dynamic Accelerator Slicer Operator TP 2
Story Points:
None

Target Version:
None
Release Blocker:
None
Sprint:
None

Sometimes, there could be problems[1] while creating the MIG slices[2] during the fetching of the GI profiles etc via the nvml library calls.
The calls could timeout or error out may be due to the underlying device health.

There should be some retry logic during such scenarios and the corresponding node accelerator object's status needs to be updated accordingly.

This status info could also help in deciding the scheduling of the new workload. i.e. The node having faulty devices can be tainted and the scheduler would take the decision of not scheduling a new workload on such device.

[1] - https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/68140/rehearse-68140-pull-ci-openshift-instaslice-operator-next-e2e-gpu-4-19/1957343396754362368
[2] - https://github.com/openshift/release/pull/68140#issuecomment-3196839984

Assignee:: Sai Ramesh Vanka

Reporter:: Sai Ramesh Vanka

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/08/20 4:35 AM

Updated:: 2025/08/21 1:36 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates