Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.4
Component/s: Metering Operator
Labels:
- migrated_from_bz

Activity Type:
Quality / Stability / Reliability
Blocked:
None
Blocked Reason:
None
Story Points:
None
Severity:
Moderate
Regression:
None
Architecture:

x86_64

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
If docs needed, set a value
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:
On an AWS cluster with 3 masters, 3 infrastructure nodes (for prometheus etc) and 100+ worker nodes the presto-coordinator pod regularly crashes and restarts. This started once load was put on the cluster via mastervertical 500 (mastervertical is a cluster density focused test that creates pods, builds, secrets, routes, etc across the cluster).

Presto then continued to flap once the load was removed from the system, albeit less frequently.

Version-Release number of selected component (if applicable):

oc version

Client Version: 4.4.6

Server Version: 4.4.6

Kubernetes Version: v1.17.1+f63db30

Presto image - quay.io/openshift/origin-metering-presto:4.6

How reproducible:
100%

Steps to Reproduce:
1. Scale up cluster
2. Add load to system via mastervertical
3. Watch as the presto-coordinator pod restarts periodically

Actual results:
Presto pod crashes with error code 3. Presto logs show:

...
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /var/presto/logs/heap_dump.bin ...
Unable to create /var/presto/logs/heap_dump.bin: File exists
Terminating due to java.lang.OutOfMemoryError: Java heap space

Full error log attached

Expected results:
Presto pod should remain stable

Additional info:
This does not seem to fail running reports. I have been running reports when it crashes and it just delays the report but does not cause a failure in the report generation.

Metering was installed via the github page and with ./hack/openshift-install

Attached is the log file from presto container after the last crash.

Assignee:: Brett Tofel

Reporter:: Russell Zaleski

QA Contact:: Peter Ruan

Contributing Groups:: Red Hat Employee

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2020/06/18 6:24 PM

Updated:: 2025/07/27 11:29 PM

Resolved:: 2023/03/09 2:13 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates