Uploaded image for project: 'OpenShift Node'
  1. OpenShift Node
  2. OCPNODE-3590

Update the node accelerator status conditions based on device health

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None
    • None
    • None

      Sometimes, there could be problems[1] while creating the MIG slices[2] during the fetching of the GI profiles etc via the nvml library calls.
      The calls could timeout or error out may be due to the underlying device health.

      There should be some retry logic during such scenarios and the corresponding node accelerator object's status needs to be updated accordingly.

      This status info could also help in deciding the scheduling of the new workload. i.e. The node having faulty devices can be tainted and the scheduler would take the decision of not scheduling a new workload on such device.

       

      [1] - https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/68140/rehearse-68140-pull-ci-openshift-instaslice-operator-next-e2e-gpu-4-19/1957343396754362368 
      [2] - https://github.com/openshift/release/pull/68140#issuecomment-3196839984 

              svanka@redhat.com Sai Ramesh Vanka
              svanka@redhat.com Sai Ramesh Vanka
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: