Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-3182

GPU dropdown shows 1 GPU available even if already in use

    XMLWordPrintable

Details

    • 3
    • False
    • False
    • Release Notes
    • No
    • 1.10.0-6
    • No
    • Hide
      Incorrect number of available GPUs were displayed in JupyterHub:: When a user attempted to create a notebook instance in JupyterHub, the maximum number of GPUs available for scheduling was not updated as GPUs were assigned. As a result, there was a delay if the user requested a GPU that was already assigned.
      Show
      Incorrect number of available GPUs were displayed in JupyterHub:: When a user attempted to create a notebook instance in JupyterHub, the maximum number of GPUs available for scheduling was not updated as GPUs were assigned. As a result, there was a delay if the user requested a GPU that was already assigned.
    • Documented as Resolved Issue
    • No
    • Yes
    • None
    • MODH Sprint 1.9, MODH Sprint 1.10

    Description

      Description of problem:

      When GPUs are enabled and the JH spawner shows the GPU selection dropdown, the number of GPUs that can be requested does not decrease as GPUs get assigned.

      If the cluster has 1 GPU available, and user1 spawns a server with 1 GPU attached, user2 will keep seeing 1 GPU available in the spawner. Furthermore, if user2 tries spawning a server while requesting 1 GPU, they will be stuck waiting for either the JH timeout (10 minutes) or for user1 to kill their server and release the GPU.

      Prerequisites (if any, like setup, operators/versions):

      RHODS 1.7.0-5 on OSD running OCP 4.10; GPU operator installed, at least 1 GPU node provisioned on cluster

      Steps to Reproduce

      1. log in as user 1
      2. spawn notebook server with 1 GPU attached
      3. log out without closing the server
      4. log in as user 2
      5. try to spawn notebook server requesting 1 GPU

      Actual results:

      user 2 can request 1 GPU, but the server will not be spawned because of lack of available resources. If the 10 minute timeout passes the spawning process will fail.

      Expected results:

      User 2 should not see any GPUs available if the GPU is already attached to user1's server.
      When spawning the server, the user should not be stuck waiting for 10 minutes.

      Reproducibility (Always/Intermittent/Only Once):

      Always

      Build Details:

      RHODS 1.7.0-5 on OCP 4.10 rc7

      Workaround:

      No real workaround, user 1 can unblock user 2 by killing their own server and freeing up the GPU

      Additional info:

      Attachments

        Activity

          People

            vhire Vaishnavi Hire
            rhn-support-lgiorgi Luca Giorgi
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: