-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
False
-
-
False
-
-
-
-
-
-
-
It happen that http status code 502 occures while scalind down Pod. Test is about the fact it should not happen. I think what could happen here is transition from Ready state true->false take some time. When scaling down pod, pod pretend to be Ready=true to last moment when Pod is deleted. So it can happen Pod gets request, although it is not ready to handle it.
JFD:
we can see in the logs (https://jenkins.eapqe.psi.redhat.com/job/eap-8.x-openshift-4-xp6-openjdk21/29/artifact/test-eap/log/test.log) a lot of 503, 502 and even 404.
I suspect that 503 are expected (that is the server not accepting new requests during graceful shutdown). The 502 means that the router still knows the pod but it is down. This window could be removed I think.
I found that, as you think, the readiness probe state is used to remove a pod from the tables used for routing (and is not automatically set to false when a pod is terminating).
We have a readiness probe with a default periodSeconds (so 10 seconds). So it takes some time for the cluster to identify that a pod can't receive anymore requests.
What we could do:
We should increase the terminationGracePeriodSeconds of the pod (default to 30 secs) to be something like (90secs)
When SIGTERM is received, immediately set the readiness probe to the false state but keep the server still receiving requests.
Wait (I would say something like 20 secs) to make sure that the cluster has acknowledged that the pod is no more ready (and will not route more requests).
Do the graceful shutdownIt seems to me that this is the way to ensure that a pod will be removed from the service rooting tables prior to initiate the graceful shutdown.
In term of implementation I would let Jeff Mesnil comment. I was thinking at a new flag to force the health subsystem to report ready == false. This is what we would call first thing in the SIG_TERM hook, then we would wait 20 secs, then initiate the server shutdown (with 60sec timeout).
Doing so we should even avoid 503 kind of errors (so no downtime) The server will be shutdown only when the pod is no more known by the service.