Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-1536

Unable to finish CUDA Builds in Open Shift Dedicated due to unschedulable pod

XMLWordPrintable

    • 2
    • False
    • False
      • If exists, I would like to have a link to a user document with the minimal cluster requirements
      • If my actual cluster size is supported, a fix to the CUDA build process to make it succeed without manual interventions
    • No
    • No
    • Undefined
    • Yes
    • Yes
    • None
    • MODH Sprint 27, MODH Sprint 28, MODH Sprint 29

      Note: this bug is probably the same as RHODS-1526, but that one is in a PSI cluster and this one is in a OpenShift Dedicated cluster

      Description of problem:

      We have a cluster in OpenShift Dedicated with:

      • 3 master nodes (m5.xlarge: 4 cpu 16GB RAM)
      • 2 infra,worker  nodes (r5.xlarge: 4cpu 32GB RAM)
      • 3 worker nodes (m5.xlarge: 4 cpu 16GB RAM)

      When we install RHODS using the Add-On in OpenShift Cluster Manager the CUDA builds are not able to finish due to insufficient cpu (see attached screenshots)

       

       

      Prerequisites (if any, like setup, operators/versions):

      Steps to Reproduce

      1. Login to https://qaprodauth.cloud.redhat.com
      2. Go to OpenShift OpenShift Cluster Manager
      3. Click into an existing cluster
      4. Add-ons
      5. Install RHODS
      6. While installing, log-in into the cluster as cluster-admin and go to Workload > Pods
      7. Project: redhat-ods-applications
      8. Check status for pods 11.0.3-cuda-*
      9. After a while after starting the installation, this pods should be Running or Completed, but usually they are Pending because of "Pod unschedulable: 2 Insufficient cpu ..." (see attached screenshot)

       

      Sometimes, to make the pod start I scale down the Grafana pods from 2 to 0. Freeing this 2 slots makes the CUDA build start and finish successfully.

       

      For the last 3 CUDA builds (s2i-minimal-gpu-cuda, s2i-pytorch-cuda and s2i-tensorflow-gpu)  this is not enough because they require even more cpus. What I do to make those builds start is to manually modify their BuildConfig to reduce the amount of requested cpus and start a new build after:

      Original BuildConfig:

        resources:
          limits:
            cpu: '4'
            memory: 8Gi
          requests:
            cpu: '2'
            memory: 6Gi
      

      Modified BuildConfig:

        resources:
          limits:
            cpu: '4'
            memory: 8Gi
          requests:
            cpu: '1'
            memory: 6Gi
      

       

      In Getting started with Red Hat OpenShift Data Science I haven't seen any section where it says what is the minimal cluster size required to install and run RHODS, but I think it should be specified there.

       

      Also, I think the CUDA builds should be somehow optimized to reduce the amount of simultaneous cpus requested to avoid this problem

       

      Actual results:

       

      Expected results:

      Reproducibility (Always/Intermittent/Only Once):

      Build Details: RHODS 1.0.17 in OpenShift Dedicated. It also happened in previous versions

      Additional info:

        1. cuda-build-pending-pod-unschedulable.png
          143 kB
          Jorge Garcia Oncins
        2. cuda-build-pending-pod-unschedulable-cluster-overview.png
          189 kB
          Jorge Garcia Oncins
        3. cuda-build-pending-pod-unschedulable-nodes.png
          106 kB
          Jorge Garcia Oncins
        4. jupyterhub-pending-unscheduable.png
          172 kB
          Jorge Garcia Oncins
        5. rhods-add-on-installation-cant-start-underpowered-cluster.png
          230 kB
          Jorge Garcia Oncins
        6. rhods-cluster-too-small.png
          33 kB
          Trevor Mckay
        7. s2i-minimal-unschedulable-2021-07-28.png
          176 kB
          Jorge Garcia Oncins
        8. s2i-minimap-gpu-cuda-build-pending-pod-unschedulable.png
          221 kB
          Jorge Garcia Oncins
        9. s2i-minimap-gpu-cuda-build-pending-pod-unschedulable-info01.png
          85 kB
          Jorge Garcia Oncins
        10. s2i-minimap-gpu-cuda-build-pending-pod-unschedulable-info02.png
          192 kB
          Jorge Garcia Oncins

          There are no Sub-Tasks for this issue.

              tmckay@redhat.com Trevor Mckay (Inactive)
              rhn-support-jgarciao Jorge Garcia Oncins
              Pablo Felix Pablo Felix (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: