-
Bug
-
Resolution: Done
-
Undefined
-
None
-
4.4
-
Quality / Stability / Reliability
-
None
-
None
-
None
-
Moderate
-
None
-
x86_64
-
None
-
None
-
None
-
None
-
None
-
If docs needed, set a value
-
None
-
None
-
None
-
None
-
None
Description of problem:
On an AWS cluster with 3 masters, 3 infrastructure nodes (for prometheus etc) and 100+ worker nodes the presto-coordinator pod regularly crashes and restarts. This started once load was put on the cluster via mastervertical 500 (mastervertical is a cluster density focused test that creates pods, builds, secrets, routes, etc across the cluster).
Presto then continued to flap once the load was removed from the system, albeit less frequently.
Version-Release number of selected component (if applicable):
- oc versionĀ
Client Version: 4.4.6
Server Version: 4.4.6
Kubernetes Version: v1.17.1+f63db30
Presto image - quay.io/openshift/origin-metering-presto:4.6
How reproducible:
100%
Steps to Reproduce:
1. Scale up cluster
2. Add load to system via mastervertical
3. Watch as the presto-coordinator pod restarts periodically
Actual results:
Presto pod crashes with error code 3. Presto logs show:
...
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /var/presto/logs/heap_dump.bin ...
Unable to create /var/presto/logs/heap_dump.bin: File exists
Terminating due to java.lang.OutOfMemoryError: Java heap space
Full error log attached
Expected results:
Presto pod should remain stable
Additional info:
This does not seem to fail running reports. I have been running reports when it crashes and it just delays the report but does not cause a failure in the report generation.
Metering was installed via the github page and with ./hack/openshift-install
Attached is the log file from presto container after the last crash.