Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: RHODS_1.1_GA
Affects Version/s: RHODS_1.1_GA
Component/s: Install Upgrade Uninstall, Workbenches
Labels:
- Eng
- Groomed

Story Points:
2
Blocked:
False
Ready:
False
Acceptance Criteria:
- If exists, I would like to have a link to a user document with the minimal cluster requirements
- If my actual cluster size is supported, a fix to the CUDA build process to make it succeed without manual interventions
Automated:
No
CDW blocker:
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Regression:
No
Release Note Text:
Undefined
Target Release:

RHODS_1.1_GA
Test Blocker:
Yes
Test Coverage:

Yes
Watchlist Impact:
None
Market:

Sprint:
MODH Sprint 27, MODH Sprint 28, MODH Sprint 29

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Note: this bug is probably the same as ~~RHODS-1526~~, but that one is in a PSI cluster and this one is in a OpenShift Dedicated cluster

Description of problem:

We have a cluster in OpenShift Dedicated with:

3 master nodes (m5.xlarge: 4 cpu 16GB RAM)
2 infra,worker nodes (r5.xlarge: 4cpu 32GB RAM)
3 worker nodes (m5.xlarge: 4 cpu 16GB RAM)

When we install RHODS using the Add-On in OpenShift Cluster Manager the CUDA builds are not able to finish due to insufficient cpu (see attached screenshots)

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Login to https://qaprodauth.cloud.redhat.com
Go to OpenShift OpenShift Cluster Manager
Click into an existing cluster
Add-ons
Install RHODS
While installing, log-in into the cluster as cluster-admin and go to Workload > Pods
Project: redhat-ods-applications
Check status for pods 11.0.3-cuda-*
After a while after starting the installation, this pods should be Running or Completed, but usually they are Pending because of "Pod unschedulable: 2 Insufficient cpu ..." (see attached screenshot)

Sometimes, to make the pod start I scale down the Grafana pods from 2 to 0. Freeing this 2 slots makes the CUDA build start and finish successfully.

For the last 3 CUDA builds (s2i-minimal-gpu-cuda, s2i-pytorch-cuda and s2i-tensorflow-gpu) this is not enough because they require even more cpus. What I do to make those builds start is to manually modify their BuildConfig to reduce the amount of requested cpus and start a new build after:

Original BuildConfig:

  resources:
    limits:
      cpu: '4'
      memory: 8Gi
    requests:
      cpu: '2'
      memory: 6Gi

Modified BuildConfig:

  resources:
    limits:
      cpu: '4'
      memory: 8Gi
    requests:
      cpu: '1'
      memory: 6Gi

In Getting started with Red Hat OpenShift Data Science I haven't seen any section where it says what is the minimal cluster size required to install and run RHODS, but I think it should be specified there.

Also, I think the CUDA builds should be somehow optimized to reduce the amount of simultaneous cpus requested to avoid this problem

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details: RHODS 1.0.17 in OpenShift Dedicated. It also happened in previous versions

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

cuda-build-pending-pod-unschedulable.png
2021/07/21 3:12 PM
143 kB
Jorge Garcia Oncins
cuda-build-pending-pod-unschedulable-cluster-overview.png
2021/07/21 3:14 PM
189 kB
Jorge Garcia Oncins
cuda-build-pending-pod-unschedulable-nodes.png
2021/07/21 3:14 PM
106 kB
Jorge Garcia Oncins
jupyterhub-pending-unscheduable.png
2021/07/21 3:14 PM
172 kB
Jorge Garcia Oncins
rhods-add-on-installation-cant-start-underpowered-cluster.png
2021/09/29 11:21 AM
230 kB
Jorge Garcia Oncins
rhods-cluster-too-small.png
2021/08/26 8:11 PM
33 kB
Trevor Mckay
s2i-minimal-unschedulable-2021-07-28.png
2021/07/28 11:37 AM
176 kB
Jorge Garcia Oncins
s2i-minimap-gpu-cuda-build-pending-pod-unschedulable.png
2021/07/21 3:14 PM
221 kB
Jorge Garcia Oncins
s2i-minimap-gpu-cuda-build-pending-pod-unschedulable-info01.png
2021/07/21 3:14 PM
85 kB
Jorge Garcia Oncins
s2i-minimap-gpu-cuda-build-pending-pod-unschedulable-info02.png
2021/07/21 3:14 PM
192 kB
Jorge Garcia Oncins

is related to

RHODS-1526 PyTorch and Tensorflow build pods timing out

Closed

RHODS-1548 Cuda build chain is broken in v1.0.17

Closed

mentioned on

Merge request - RHODS-1536 Added requirements for installing RHODS on OpenShift Dedicated

There are no Sub-Tasks for this issue.

Assignee:: Trevor Mckay (Inactive)

Reporter:: Jorge Garcia Oncins

QA Contact:: Pablo Felix (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2021/07/21 3:01 PM

Updated:: 2023/02/17 9:28 PM

Resolved:: 2021/09/07 5:45 PM

Details

Description

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details: RHODS 1.0.17 in OpenShift Dedicated. It also happened in previous versions

Additional info:

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates