Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-5887

"Failed to fetch GPU, something went wrong" in RHODS dashboard (1.19.0-14)

XMLWordPrintable

    • False
    • None
    • False
    • Release Notes
    • Testable
    • Yes
    • No
    • Hide
      == Incorrect number of available GPUs was displayed in Jupyter
      When a user attempts to create a notebook instance in Jupyter, the maximum number of GPUs available for scheduling was not updated as GPUs are assigned. Jupyter now displays the correct number of GPUs available.
      Show
      == Incorrect number of available GPUs was displayed in Jupyter When a user attempts to create a notebook instance in Jupyter, the maximum number of GPUs available for scheduling was not updated as GPUs are assigned. Jupyter now displays the correct number of GPUs available.
    • Bug Fix
    • No
    • Yes
    • None
    • RHODS 1.20

      Description of problem:

      in the latest RHODS 1.19 RC (1.19.0-14), the spawner page will show an error up every time the page is loaded if there is a provisioned GPU node on the cluster

      This happens either with a node directly provisioned with its own machine pool, or with a node provisioned with an autoscaler.
      The error also disables the GPU dropdown (in the default "autodetect" version) making it impossible to request a GPU by default.

      These are the pod logs from one of the dashboard pods:
      rhods-dashboard-c79597765-brvhh-rhods-dashboard.log
      Which show that the Prometheus request made by the dashboard to get the number of GPUs triggers a response showing the OpenShift login page.

      There seem to be two workaround to spawn a server on a gpu node:

      1. Create an autoscaler with a minimum of 0 nodes
        1. Once the spawner page shows the option to request a GPU, try spawning a server
        2. this should trigger an autoscale request, provisioning GPU nodes
        3. Once the node is provisioned and the GPU addon has installed the CUDA driver, the server pod should be scheduled on the node
          1. After this point other users won't be able to request GPUs since the node has been provisioned and the spawner will start showing the error and hiding the dropdown on each page load
        4. From the spawner modal, choose to open the server in the same/another tab
      2. use the gpuSetting field in the OdhDashboardConfig CR
        1. set the gpuSetting in the CR to '1' (or max number of GPUs in your nodes)
        2. Provision a node (or more) with the same number of GPUs per node
        3. Once the node is running and labeled by the nvidia gpu addon, start a spawn request with 1 or more GPUs attached
        4. OpenShift should correctly place the server pod on the gpu node.

      Prerequisites (if any, like setup, operators/versions):

      RHODS 1.19.0-14

      OCP 4.10 (the latest version of 4.11, i.e. 4.11.12 at the time of writing, is incompatible with the nvidia gpu addon)

       

      Steps to Reproduce

      1. install RHODS 1.19.0-14
      2. Provision gpu node / autoscaling machine pool with min>=1
      3. Install GPU addon
      4. Visit RHODS spawner page

      Actual results:

      An error popup is shown "Failed to fetch GPU, something went wrong"
      the gpu dropdown is hidden

      Expected results:

      No error popup on page load, gpu dropdown correctly showing the maximum number of gpus that can be requested

      Reproducibility (Always/Intermittent/Only Once):

      Always on one cluster

      Build Details:

      rhods 1.19.0-14 on OCP 4.10 latest

      Workaround:

      There seem to be two workaround to spawn a gpu node:

      1. Create an autoscaler with a minimum of 0 nodes
        1. Once the spawner page shows the option to request a GPU, try spawning a server
        2. this should trigger an autoscale request, provisioning GPU nodes
        3. Once the node is provisioned and the GPU addon has installed the CUDA driver, the server pod should be scheduled on the node
          1. After this point other users won't be able to request GPUs since the node has been provisioned and the spawner will start showing the error and hiding the dropdown on each page load
        4. From the spawner modal, choose to open the server in the same/another tab, which should redirect/load to the server with gpu attached
      2. use the gpuSetting field in the OdhDashboardConfig CR
        1. set the gpuSetting in the CR to '1' (or max number of GPUs in your nodes)
        2. Provision a node (or more) with the same number of GPUs per node
        3. Once the node is running and labeled by the nvidia gpu addon, start a spawn request with 1 or more GPUs attached
        4. OpenShift should correctly place the server pod on the gpu node.

      Additional info:

        1. image-2022-11-11-16-09-46-261.png
          221 kB
          Luca Giorgi
        2. image-2022-11-21-12-51-28-259.png
          43 kB
          Luca Giorgi
        3. image-2022-11-21-12-53-05-401.png
          43 kB
          Luca Giorgi
        4. image-2022-11-21-14-43-18-600.png
          28 kB
          Luca Giorgi
        5. image-2022-11-21-14-54-33-937.png
          27 kB
          Luca Giorgi
        6. image-2022-11-21-15-56-30-373.png
          289 kB
          Luca Giorgi
        7. image-2022-11-21-16-00-30-226.png
          11 kB
          Luca Giorgi
        8. image-2022-11-21-16-00-34-556.png
          12 kB
          Luca Giorgi
        9. image-2022-11-21-16-00-40-008.png
          12 kB
          Luca Giorgi
        10. image-2022-11-21-16-00-44-660.png
          14 kB
          Luca Giorgi
        11. rhods-dashboard-c79597765-brvhh-rhods-dashboard.log
          767 kB
          Luca Giorgi

              llasmith@redhat.com Landon LaSmith
              rhn-support-lgiorgi Luca Giorgi
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: