Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-8152

When the cluster is full, RHODS update is incomplete, and leaves the environment "stuck"

XMLWordPrintable

    • RHODS 1.27, RHODS 1.28

      Description of problem:

      I was contacted by a customer, where the RHODS environment was no longer spawning notebooks successfully. 

      Upon investigation, it was found that many pods in the namespace redhat-ods-applications were in a pending state, and had been for a few hours.  The Notebook Controller pods were pending. (at least of them were). 

      This situation was due to a few factors

      • RHODS 1.25 (cloud version) had been released earlier in the day
      • the pods were trying to roll over (during which the requirements increase)
      • the only machine pool where these pods could run was at capacity and not configured for auto-scaling. 

      I told the customer to add a couple machines to the default node pool, and the pods finally got to a running state, and the update to RHODS 1.25 finally completed. And users were once again able to spawn notebooks. 

      Prerequisites (if any, like setup, operators/versions):

       

      Steps to Reproduce

      1. Deploy RHODS 
      2. fill up the cluster so that no new pods can be added
      3. trigger an upgrade for RHODS

      Actual results:

      RHODS gets stuck during the upgrade and stops working. 

      Expected results:

      RHODS gets stuck during the upgrade and stops working. But SRE notices, connects to the cluster, adds a couple nodes so that RHODS can finish its update, then takes away the nodes.  

       

      Reproducibility (Always/Intermittent/Only Once):

      I think this is the second time I see this. 

      Build Details:

      RHODS 1.24->1.25

      Workaround:

      Customer should maintain some "headroom" in their cluster, but we would need to tell them what that headroom is. Or at least, customer should have some autoscaling headroom. 

      Additional info:

              rh-ee-magautie Max Gautier (Inactive)
              egranger@redhat.com Erwan Granger
              Tarun Kumar Tarun Kumar
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated:
                Resolved: