[RHODS-8939] Shared memory for notebooks set to 64Mb - Red Hat Issue Tracker

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: RHODS_1.31.0_GA
Affects Version/s: RHODS_1.23.0_GA
Component/s: Workbenches
Labels:
- FailedQA
- eng
- field-priority
- groomed

Story Points:
1
Blocked:
False
Blocked Reason:
None
Ready:
False
Affects:

Release Notes
Affects Testing:

Testable
Automated:
No
CDW blocker:
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Fixed in Build:
1.31.0
Regression:
No
Release Note Text:

Hide
== Default shared memory for Jupyter notebook might cause a runtime error
 
The default shared memory for a Jupyter notebook is set to 64 Mb and you cannot change this default value in the Notebook configuration. For example, PyTorch relies on shared memory and the default size of 64 Mb is not enough for large use cases, such as when training a model or when performing heavy data manipulations. Jupyter reports a “no space left on device" message and `/dev/smh` is full.

*Workaround*

. In your data science project, create a workbench as described in [Creating a project workbench].
. In the data science project page, in the *Workbenches* section, click the *Status* toggle for the workbench to change it from *Running* to *Stopped*.
. Open your OpenShift Console and then select *Administrator*.
. Select *Home* -> *API Explorer*.
. In the *Filter by kind* field, type *notebook*.
. Select the *kubeflow v1* notebook.
. Select the *Instances* tab and then select the instance for the workbench that you created in Step 1.
. Click the *YAML* tab and then select *Actions* -> *Edit Notebook*.
. Edit the YAML file to add the following information to the configuration:

** For the container that has the name of your Workbench notebook, add the following lines to the `volumeMounts` section:
+
----
- mountPath: /dev/shm
  name: shm
----
+
For example, if your workbench name is `myworkbench`, update the YAML file as follows:
+
----
  spec:
    containers:
      - env
        ...
        name: myworkbench
        ...
         volumeMounts:
         - mountPath: /dev/shm
           name: shm
----

** In the volumes section, add the lines shown in the following example:
+
----
     volumes:
       name: shm
       emptyDir:
         medium: Memory
----
+
*Note:* Optionally, you can specify a limit to the amount of memory to use for the `emptyDir`.

. Click *Save*.

. In the data science dashboard, in the *Workbenches* section of the data science project, click the *Status* toggle for the workbench. The status changes from *Stopped* to *Starting* and then *Running*.

. Restart the notebook.

[WARNING]
====
If you later edit the notebook's configuration through the Data Science dashboard UI, your workaround edit to the notebook configuration will be erased.
====

Show
== Default shared memory for Jupyter notebook might cause a runtime error   The default shared memory for a Jupyter notebook is set to 64 Mb and you cannot change this default value in the Notebook configuration. For example, PyTorch relies on shared memory and the default size of 64 Mb is not enough for large use cases, such as when training a model or when performing heavy data manipulations. Jupyter reports a “no space left on device" message and `/dev/smh` is full. *Workaround* . In your data science project, create a workbench as described in [Creating a project workbench]. . In the data science project page, in the *Workbenches* section, click the *Status* toggle for the workbench to change it from *Running* to *Stopped*. . Open your OpenShift Console and then select *Administrator*. . Select *Home* -> *API Explorer*. . In the *Filter by kind* field, type *notebook*. . Select the *kubeflow v1* notebook. . Select the *Instances* tab and then select the instance for the workbench that you created in Step 1. . Click the *YAML* tab and then select *Actions* -> *Edit Notebook*. . Edit the YAML file to add the following information to the configuration: ** For the container that has the name of your Workbench notebook, add the following lines to the `volumeMounts` section: + ---- - mountPath: /dev/shm   name: shm ---- + For example, if your workbench name is `myworkbench`, update the YAML file as follows: + ----   spec:     containers:       - env         ...         name: myworkbench         ...          volumeMounts:          - mountPath: /dev/shm            name: shm ---- ** In the volumes section, add the lines shown in the following example: + ----      volumes:        name: shm        emptyDir:          medium: Memory ---- + *Note:* Optionally, you can specify a limit to the amount of memory to use for the `emptyDir`. . Click *Save*. . In the data science dashboard, in the *Workbenches* section of the data science project, click the *Status* toggle for the workbench. The status changes from *Stopped* to *Starting* and then *Running*. . Restart the notebook. [WARNING] ==== If you later edit the notebook's configuration through the Data Science dashboard UI, your workaround edit to the notebook configuration will be erased. ====
Release Note Type:
Known Issue
Release Note Status:
Done
Target Release:

RHODS_1.31.0_GA
Test Blocker:
No
Test Coverage:

Pending
Watchlist Impact:
None
Git Pull Request:
https://github.com/opendatahub-io/kubeflow/pull/110
Intelligence Requested:
Market:

Sprint:
RHODS 1.30, RHODS 1.31
Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Jupyter notebooks deployed by RHODS has the shared memory (/dev/shm) set to 64Mb. There doesn't appear to be any way to change this default. This creates issues with Pytorch running multiple workers on GPU enabled nodes. Multiple workers significantly speeds up training tasks, but a 64Mb limit forces the user to disable multiple workers. The increase in performance can be an order of magnitude faster and utilizes GPUs more optimally.

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Deploy a Jupyter notebook.
Clone a repo that uses Pytorch. In my case https://github.com/ultralytics/yolov5
After running "pip install -r requirements.txt" run "python train.py" with no arguments.

Actual results:

Jupyter will report "no space left on device" and you will notice /dev/smh is full.

Expected results:

The training script should begin running 100 epochs on the training data without error.

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

Workaround:

Run with no workers

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2023-07-19-12-51-23-017.png
76 kB
2023/07/19 10:51 AM

links to

opendatahub-io/kubeflow#111: Shared memory for notebooks set to 64Mb

opendatahub-io/notebooks#109: Shared memory for notebooks set to 64Mb

Assignee:: Harshad Reddy Nalla

Reporter:: David White (Inactive)

Contributors:: Harshad Reddy Nalla

QA Contact:: Luca Giorgi

Votes:: 2 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2023/05/30 1:24 PM

Updated:: 2024/12/18 1:01 PM

Resolved:: 2023/08/01 2:25 PM

Details

Description

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Workaround:

Additional info:

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates