Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: RHODS_1.19.0_GA
Affects Version/s: None
Component/s: Integrations
Labels:
- eng
- groomed

Blocked:
False
Blocked Reason:
None
Ready:
False
Affects:

Release Notes
Affects Testing:

Testable
Automated:
Yes
CDW blocker:
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Regression:
No
Release Note Text:

Hide
== Incorrect number of available GPUs was displayed in Jupyter
When a user attempts to create a notebook instance in Jupyter, the maximum number of GPUs available for scheduling was not updated as GPUs are assigned. Jupyter now displays the correct number of GPUs available.

Show
== Incorrect number of available GPUs was displayed in Jupyter When a user attempts to create a notebook instance in Jupyter, the maximum number of GPUs available for scheduling was not updated as GPUs are assigned. Jupyter now displays the correct number of GPUs available.
Release Note Type:
Bug Fix
Target Release:

RHODS_1.19.0_GA
Test Blocker:
No
Test Coverage:

Yes
Watchlist Impact:
None
Git Pull Request:
https://github.com/opendatahub-io/odh-dashboard/pull/786, https://github.com/red-hat-data-services/odh-dashboard/pull/250
Intelligence Requested:
Market:

Sprint:
RHODS 1.20

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

in the latest RHODS 1.19 RC (1.19.0-14), the spawner page will show an error up every time the page is loaded if there is a provisioned GPU node on the cluster

This happens either with a node directly provisioned with its own machine pool, or with a node provisioned with an autoscaler.
The error also disables the GPU dropdown (in the default "autodetect" version) making it impossible to request a GPU by default.

These are the pod logs from one of the dashboard pods:
rhods-dashboard-c79597765-brvhh-rhods-dashboard.log
Which show that the Prometheus request made by the dashboard to get the number of GPUs triggers a response showing the OpenShift login page.

There seem to be two workaround to spawn a server on a gpu node:

Create an autoscaler with a minimum of 0 nodes
1. Once the spawner page shows the option to request a GPU, try spawning a server
2. this should trigger an autoscale request, provisioning GPU nodes
3. Once the node is provisioned and the GPU addon has installed the CUDA driver, the server pod should be scheduled on the node
  1. After this point other users won't be able to request GPUs since the node has been provisioned and the spawner will start showing the error and hiding the dropdown on each page load
4. From the spawner modal, choose to open the server in the same/another tab
use the gpuSetting field in the OdhDashboardConfig CR
1. set the gpuSetting in the CR to '1' (or max number of GPUs in your nodes)
2. Provision a node (or more) with the same number of GPUs per node
3. Once the node is running and labeled by the nvidia gpu addon, start a spawn request with 1 or more GPUs attached
4. OpenShift should correctly place the server pod on the gpu node.

Prerequisites (if any, like setup, operators/versions):

RHODS 1.19.0-14

OCP 4.10 (the latest version of 4.11, i.e. 4.11.12 at the time of writing, is incompatible with the nvidia gpu addon)

Steps to Reproduce

install RHODS 1.19.0-14
Provision gpu node / autoscaling machine pool with min>=1
Install GPU addon
Visit RHODS spawner page

Actual results:

An error popup is shown "Failed to fetch GPU, something went wrong"
the gpu dropdown is hidden

Expected results:

No error popup on page load, gpu dropdown correctly showing the maximum number of gpus that can be requested

Reproducibility (Always/Intermittent/Only Once):

Always on one cluster

Build Details:

rhods 1.19.0-14 on OCP 4.10 latest

Workaround:

There seem to be two workaround to spawn a gpu node:

Create an autoscaler with a minimum of 0 nodes
1. Once the spawner page shows the option to request a GPU, try spawning a server
2. this should trigger an autoscale request, provisioning GPU nodes
3. Once the node is provisioned and the GPU addon has installed the CUDA driver, the server pod should be scheduled on the node
  1. After this point other users won't be able to request GPUs since the node has been provisioned and the spawner will start showing the error and hiding the dropdown on each page load
4. From the spawner modal, choose to open the server in the same/another tab, which should redirect/load to the server with gpu attached
use the gpuSetting field in the OdhDashboardConfig CR
1. set the gpuSetting in the CR to '1' (or max number of GPUs in your nodes)
2. Provision a node (or more) with the same number of GPUs per node
3. Once the node is running and labeled by the nvidia gpu addon, start a spawn request with 1 or more GPUs attached
4. OpenShift should correctly place the server pod on the gpu node.

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2022-11-11-16-09-46-261.png
221 kB
2022/11/11 3:09 PM
image-2022-11-21-12-51-28-259.png
43 kB
2022/11/21 11:51 AM
image-2022-11-21-12-53-05-401.png
43 kB
2022/11/21 11:53 AM
image-2022-11-21-14-43-18-600.png
28 kB
2022/11/21 1:43 PM
image-2022-11-21-14-54-33-937.png
27 kB
2022/11/21 1:54 PM
image-2022-11-21-15-56-30-373.png
289 kB
2022/11/21 2:56 PM
image-2022-11-21-16-00-30-226.png
11 kB
2022/11/21 3:00 PM
image-2022-11-21-16-00-34-556.png
12 kB
2022/11/21 3:00 PM
image-2022-11-21-16-00-40-008.png
12 kB
2022/11/21 3:00 PM
image-2022-11-21-16-00-44-660.png
14 kB
2022/11/21 3:00 PM
rhods-dashboard-c79597765-brvhh-rhods-dashboard.log
767 kB
2022/11/11 3:14 PM

Assignee:: Landon LaSmith

Reporter:: Luca Giorgi

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2022/11/11 3:26 PM

Updated:: 2023/02/17 8:21 PM

Resolved:: 2022/11/22 3:42 PM

Details

Description

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Workaround:

Additional info:

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates