Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: OpenShift
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Target Release:

8.backlog.GA
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

It happen that http status code 502 occures while scalind down Pod. Test is about the fact it should not happen. I think what could happen here is transition from Ready state true->false take some time. When scaling down pod, pod pretend to be Ready=true to last moment when Pod is deleted. So it can happen Pod gets request, although it is not ready to handle it.

JFD:

we can see in the logs (https://jenkins.eapqe.psi.redhat.com/job/eap-8.x-openshift-4-xp6-openjdk21/29/artifact/test-eap/log/test.log) a lot of 503, 502 and even 404.

I suspect that 503 are expected (that is the server not accepting new requests during graceful shutdown). The 502 means that the router still knows the pod but it is down. This window could be removed I think.

I found that, as you think, the readiness probe state is used to remove a pod from the tables used for routing (and is not automatically set to false when a pod is terminating).

We have a readiness probe with a default periodSeconds (so 10 seconds). So it takes some time for the cluster to identify that a pod can't receive anymore requests.

What we could do:

We should increase the terminationGracePeriodSeconds of the pod (default to 30 secs) to be something like (90secs)
When SIGTERM is received, immediately set the readiness probe to the false state but keep the server still receiving requests.
Wait (I would say something like 20 secs) to make sure that the cluster has acknowledged that the pod is no more ready (and will not route more requests).
Do the graceful shutdown

It seems to me that this is the way to ensure that a pod will be removed from the service rooting tables prior to initiate the graceful shutdown.

In term of implementation I would let Jeff Mesnil comment. I was thinking at a new flag to force the health subsystem to report ready == false. This is what we would call first thing in the SIG_TERM hook, then we would wait 20 secs, then initiate the server shutdown (with 60sec timeout).

Doing so we should even avoid 503 kind of errors (so no downtime) The server will be shutdown only when the pod is no more known by the service.

Assignee:: Unassigned

Reporter:: Martin Choma

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/11/11 3:43 PM

Updated:: 2025/11/11 3:58 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide