-
Bug
-
Resolution: Done
-
Critical
-
None
-
3
-
False
-
False
-
Release Notes
-
No
-
-
-
-
-
-
1.10.0-6
-
No
-
-
Documented as Resolved Issue
-
No
-
Yes
-
None
-
-
MODH Sprint 1.9, MODH Sprint 1.10
Description of problem:
When GPUs are enabled and the JH spawner shows the GPU selection dropdown, the number of GPUs that can be requested does not decrease as GPUs get assigned.
If the cluster has 1 GPU available, and user1 spawns a server with 1 GPU attached, user2 will keep seeing 1 GPU available in the spawner. Furthermore, if user2 tries spawning a server while requesting 1 GPU, they will be stuck waiting for either the JH timeout (10 minutes) or for user1 to kill their server and release the GPU.
Prerequisites (if any, like setup, operators/versions):
RHODS 1.7.0-5 on OSD running OCP 4.10; GPU operator installed, at least 1 GPU node provisioned on cluster
Steps to Reproduce
- log in as user 1
- spawn notebook server with 1 GPU attached
- log out without closing the server
- log in as user 2
- try to spawn notebook server requesting 1 GPU
Actual results:
user 2 can request 1 GPU, but the server will not be spawned because of lack of available resources. If the 10 minute timeout passes the spawning process will fail.
Expected results:
User 2 should not see any GPUs available if the GPU is already attached to user1's server.
When spawning the server, the user should not be stuck waiting for 10 minutes.
Reproducibility (Always/Intermittent/Only Once):
Always
Build Details:
RHODS 1.7.0-5 on OCP 4.10 rc7
Workaround:
No real workaround, user 1 can unblock user 2 by killing their own server and freeing up the GPU