Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-8939

Shared memory for notebooks set to 64Mb

    XMLWordPrintable

Details

    • 1
    • False
    • None
    • False
    • Release Notes
    • Testable
    • No
    • 1.31.0
    • No
    • Hide
      == Default shared memory for Jupyter notebook might cause a runtime error

      The default shared memory for a Jupyter notebook is set to 64 Mb and you cannot change this default value in the Notebook configuration. For example, PyTorch relies on shared memory and the default size of 64 Mb is not enough for large use cases, such as when training a model or when performing heavy data manipulations. Jupyter reports a “no space left on device" message and `/dev/smh` is full.

      *Workaround*

      . In your data science project, create a workbench as described in [Creating a project workbench].
      . In the data science project page, in the *Workbenches* section, click the *Status* toggle for the workbench to change it from *Running* to *Stopped*.
      . Open your OpenShift Console and then select *Administrator*.
      . Select *Home* -> *API Explorer*.
      . In the *Filter by kind* field, type *notebook*.
      . Select the *kubeflow v1* notebook.
      . Select the *Instances* tab and then select the instance for the workbench that you created in Step 1.
      . Click the *YAML* tab and then select *Actions* -> *Edit Notebook*.
      . Edit the YAML file to add the following information to the configuration:

      ** For the container that has the name of your Workbench notebook, add the following lines to the `volumeMounts` section:
      +
      ----
      - mountPath: /dev/shm
        name: shm
      ----
      +
      For example, if your workbench name is `myworkbench`, update the YAML file as follows:
      +
      ----
      

spec:
          containers:
            - env
              ...
              name: myworkbench
              ...
               volumeMounts:
               - mountPath: /dev/shm
                 name: shm
      ----

      ** In the volumes section, add the lines shown in the following example:
      +
      ----
           volumes:
             name: shm
             emptyDir:
               medium: Memory
      ----
      +
      *Note:* Optionally, you can specify a limit to the amount of memory to use for the `emptyDir`.

      . Click *Save*.

      . In the data science dashboard, in the *Workbenches* section of the data science project, click the *Status* toggle for the workbench. The status changes from *Stopped* to *Starting* and then *Running*.

      . Restart the notebook.

      [WARNING]
      ====
      If you later edit the notebook's configuration through the Data Science dashboard UI, your workaround edit to the notebook configuration will be erased.
      ====
      Show
      == Default shared memory for Jupyter notebook might cause a runtime error 
 The default shared memory for a Jupyter notebook is set to 64 Mb and you cannot change this default value in the Notebook configuration. For example, PyTorch relies on shared memory and the default size of 64 Mb is not enough for large use cases, such as when training a model or when performing heavy data manipulations. Jupyter reports a “no space left on device" message and `/dev/smh` is full. *Workaround* . In your data science project, create a workbench as described in [Creating a project workbench]. . In the data science project page, in the *Workbenches* section, click the *Status* toggle for the workbench to change it from *Running* to *Stopped*. . Open your OpenShift Console and then select *Administrator*. . Select *Home* -> *API Explorer*. . In the *Filter by kind* field, type *notebook*. . Select the *kubeflow v1* notebook. . Select the *Instances* tab and then select the instance for the workbench that you created in Step 1. . Click the *YAML* tab and then select *Actions* -> *Edit Notebook*. . Edit the YAML file to add the following information to the configuration: ** For the container that has the name of your Workbench notebook, add the following lines to the `volumeMounts` section: + ---- - mountPath: /dev/shm   name: shm ---- + For example, if your workbench name is `myworkbench`, update the YAML file as follows: + ---- 

spec:     containers:       - env         ...         name: myworkbench         ...          volumeMounts:          - mountPath: /dev/shm            name: shm ---- ** In the volumes section, add the lines shown in the following example: + ----      volumes:        name: shm        emptyDir:          medium: Memory ---- + *Note:* Optionally, you can specify a limit to the amount of memory to use for the `emptyDir`. . Click *Save*. . In the data science dashboard, in the *Workbenches* section of the data science project, click the *Status* toggle for the workbench. The status changes from *Stopped* to *Starting* and then *Running*. . Restart the notebook. [WARNING] ==== If you later edit the notebook's configuration through the Data Science dashboard UI, your workaround edit to the notebook configuration will be erased. ====
    • Known Issue
    • Done
    • No
    • Pending
    • None
    • RHODS 1.30, RHODS 1.31
    • High

    Description

      Description of problem:

      Jupyter notebooks deployed by RHODS has the shared memory (/dev/shm) set to 64Mb. There doesn't appear to be any way to change this default. This creates issues with Pytorch running multiple workers on GPU enabled nodes. Multiple workers significantly speeds up training tasks, but a 64Mb limit forces the user to disable multiple workers. The increase in performance can be an order of magnitude faster and utilizes GPUs more optimally. 

      Prerequisites (if any, like setup, operators/versions):

      Steps to Reproduce

      1. Deploy a Jupyter notebook.
      2. Clone a repo that uses Pytorch. In my case https://github.com/ultralytics/yolov5
      3. After running "pip install -r requirements.txt" run "python train.py" with no arguments. 

      Actual results:

      Jupyter will report "no space left on device" and you will notice /dev/smh is full.

      Expected results:

      The training script should begin running 100 epochs on the training data without error.

      Reproducibility (Always/Intermittent/Only Once):

      Always

      Build Details:

      Workaround:

      Run with no workers

      Additional info:

      Attachments

        Activity

          People

            hnalla Harshad Reddy Nalla
            dawhite20910 David White (Inactive)
            Harshad Reddy Nalla
            Luca Giorgi Luca Giorgi
            Votes:
            2 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: