Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-16190

Hive Memory Leak in ARO On 0.25.4 to 0.26.1 K8s Dependency Bumps

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • premerge
    • Hive
    • Important
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      ARO runs Hive on AKS.  We've been restricted to running an older version due to a memory leak within hive.  We've worked with the hive team to narrow down the memory leak to start occurring at hive commit: 33c5cd37cd.  The leak happens within an hour in larger regions and will eventually end up causing k8s to kill the pod because of OOM issues.  
      
      The memory leak happens most notably within the hibernation controller reconciliation loop.  

      Version-Release number of selected component (if applicable):

      https://github.com/openshift/hive/tree/33c5cd37cd and beyond are affected

      How reproducible:

      Every version is affected on ARO AKS clusters.  

      Steps to Reproduce:

      1. Create an AKS cluster
      2. Deploy affected hive version to the AKS cluster
      3. Run the ARO RP on the hive cluster 
      4. Create a cluster using the ARO RP
      5. Update the cluster service principal credentials to be invalid to ensure the leak progresses faster
      6. Watch the memory consumption of the hive-controller pod slowly increase
      

      Actual results:

      Hive has a memory leak and eventually crashes

      Expected results:

      Hive does not have a memory leak and memory usage is stable.  

      Additional info:

      Relevant thread: https://redhat-internal.slack.com/archives/CE3ETN3J8/p1688416480405189
      
      Linked story on ARO side: https://issues.redhat.com/browse/ARO-3639

       

            jstuever@redhat.com Jeremiah Stuever
            bvesel@redhat.com Benjamin Vesel
            Jianping Shu Jianping Shu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: