-
Enhancement
-
Resolution: Unresolved
-
Major
-
None
-
EAP64 1.8.9.GA
xPaaS Liveness Probe doesn't fail when timeout of request is observed as the timeoutSeconds parameter has no effect on the readiness and liveness probes for Container Execution Checks because the timeout argument to this function is ignored by dockerhsim in the call to docker API.
We have a workaround that can negate this limitation.
Steps to Reproduce:
############################################
[...]
livenessProbe:
exec:
command:
- /bin/bash
- '-c'
- /opt/eap/bin/livenessProbe.sh
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
############################################
The liveliness probe is configured to restart the pod when the script fails to respond 3 times with an interval of 10 seconds. and the timeout is configured at 1 sec, So the pod is expected to restart in 30 seconds after the failure or timeout of probe request.
1. Deploy eap64-basic-s2i application
2. oc rsh <eap-pod>
3. Stop the JBoss server --> sh-4.2$ kill -STOP <PID>
4. Stop the livliness.sh process --> sh-4.2$ kill -STOP <PID>
5. Check atomic-openshift-node logs (log-level=4) on the node where pod is deployed.
Actual results:
sh-4.2$ time /opt/eap/bin/livenessProbe.sh
^C^CTraceback (most recent call last):
File "/opt/eap/bin/probes/runner.py", line 113, in <module>
time.sleep(args.sleep)
KeyboardInterrupt
real 0m49.311s
user 0m0.106s
sys 0m0.037s
The pod will get stuck in a zombie state and the will not restart.
==========================================================================
The RFE requested for below workaround to be made default in the image
############################################
[...]
livenessProbe:
exec:
command:
- /bin/bash
- '-c'
- timeout 60 /opt/eap/bin/livenessProbe.sh <==Changed
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
############################################
With the workaround applied, OpenShift properly detects and records as failed the probes' scripts executions frozen for more than 60 seconds,and it correctly restart the container after 3 failures.
The timeout value can be reduced to minimize the delay.