Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: EAP64 1.8.4.GA, EAPCD 13.0.GA, EAP71 1.3.1.GA, EAP72 7.2.1-OpenJDK 11.GA
Component/s: Common, EAP7, EAP_CD
Labels:
- 7.2.x-openjdk11

CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Target Release:

EAP72 7.2.3-OpenJDK 11.GA
Git Pull Request:
https://github.com/jboss-container-images/jboss-eap-modules/pull/115

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Per the k8s docs[1], retry of probes before treating the probe as failed can be configured in the probe config provided to k8s. In our case that's in set in the application template livenessProbe/readinessProbe config section that ultimately configures k8s to call the our livenessProbe.sh and readinessProbe.sh.

Further, those docs indicate that by default probes should not take longer than 1 sec to execute, otherwise the probe will be considered failed. That timeout can be a higher value, but again the templates would need to set that.

Per the bug report at [2] it seems k8s is not properly enforcing the timeout, but that could change at any time, so we should work to ensure our probes do not start failing if OpenShift moves to a k8s release with this fixed.

The retry and timeout issues are related because one reason our probes might take a long time to complete is that they currently attempt to do retries internally.

1) The scripts in the os-eap-probes module check for COUNT and SLEEP args to the script (which would be set in the application template livenessProbe/readinessProbe config section) and default to 30 and 5 respectively. That means in case of failure, the retry will take longer than 1 sec, so once the issue at [2] is fixed the retries will no longer be meaningful.

So, templates should use periodSeconds and failureThreshold to configure retries, and should set the "COUNT" arg to the scripts (first arg) to 1, disabling internal retry.

At some point the default value of COUNT in the scripts could be changed to 1. Care needs to be taken with this though as that would change the behavior of images that don't include the updated k8s settings.

2) Also, livenessProbe.sh sleeps for 5 secs before beginning the probe.

# Sleep for 5 seconds to avoid launching readiness and liveness probes
# at the same time
sleep 5

If this is still a concern we need to find a different solution.

This will probably need subtasks or something, so different product teams can adjust their own templates.

[1] https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes
[2] https://github.com/kubernetes/kubernetes/issues/26895

causes

JBEAP-16858 [OCP 4.1] Pod is not restarted when MP Health returns DOWN or UNDETERMINED

Closed

is cloned by

CLOUD-3245 [7.2.x] Allow kubernetes to control probe retries; avoid probes taking longer than kubernetes timeout settings

is incorporated by

CLOUD-3268 EAP 7.2.3-opendk11 OpenShift Image release

Closed

Assignee:: Ken Wills

Reporter:: Brian Stansberry

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2018/07/24 5:14 PM

Updated:: 2024/02/08 3:05 PM

Resolved:: 2019/07/16 3:01 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates