Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-8781

Presto Coordinator flapping when at scale and load

XMLWordPrintable

    • Quality / Stability / Reliability
    • None
    • None
    • None
    • Moderate
    • None
    • x86_64
    • None
    • None
    • None
    • None
    • None
    • If docs needed, set a value
    • None
    • None
    • None
    • None
    • None

      Description of problem:
      On an AWS cluster with 3 masters, 3 infrastructure nodes (for prometheus etc) and 100+ worker nodes the presto-coordinator pod regularly crashes and restarts. This started once load was put on the cluster via mastervertical 500 (mastervertical is a cluster density focused test that creates pods, builds, secrets, routes, etc across the cluster).

      Presto then continued to flap once the load was removed from the system, albeit less frequently.

      Version-Release number of selected component (if applicable):

      1. oc versionĀ 

      Client Version: 4.4.6

      Server Version: 4.4.6

      Kubernetes Version: v1.17.1+f63db30

      Presto image - quay.io/openshift/origin-metering-presto:4.6

      How reproducible:
      100%

      Steps to Reproduce:
      1. Scale up cluster
      2. Add load to system via mastervertical
      3. Watch as the presto-coordinator pod restarts periodically

      Actual results:
      Presto pod crashes with error code 3. Presto logs show:

      ...
      java.lang.OutOfMemoryError: Java heap space
      Dumping heap to /var/presto/logs/heap_dump.bin ...
      Unable to create /var/presto/logs/heap_dump.bin: File exists
      Terminating due to java.lang.OutOfMemoryError: Java heap space

      Full error log attached

      Expected results:
      Presto pod should remain stable

      Additional info:
      This does not seem to fail running reports. I have been running reports when it crashes and it just delays the report but does not cause a failure in the report generation.

      Metering was installed via the github page and with ./hack/openshift-install

      Attached is the log file from presto container after the last crash.

              btofelrh Brett Tofel
              rhn-support-rzaleski Russell Zaleski
              None
              None
              Peter Ruan Peter Ruan
              None
              Red Hat Employee
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: