One of the problems which can be often seen when doing rolling updates or other forms of pod shutdowns is that the sidecar pod handling TLS connections to Zookeeper shutdowns very quickly and leaves the main pod stranded without any connection to Zookeeper. The most obvious problem it causes is with Kafka containers - when the sidecar shutdowns before the main pod, it will leave Kafka waiting for Zookeeper connection until the grace period is over and the pod is killed.
To prevent this, we should keep the sidecar running until the main pod(s) are shutdown and only then shut it down. This PR is implementing just this using the pre-stop hook in the TLS sidecar containers. The pre-stop hook waits until the main container is gone and only then end and lets Kubernetes to send the termination signal to the sidecar container. If this doesn't happen until the grace period is over, the pod will be simply killed. This should allow the main pods to shutdown cleanly, which is important especially with Kafka pods.
The pre-stop hook is currently using the TCP connections to check the state of the main pod:
For Kafka and EO it waits until there are no connections through the 2181 port on the sidecar (i.e. the main pod is not connected to ZK anymore)
For ZK this is a bit more complicated as for example the leader has no outgoing connections. So there we wait for the main pod to close its listeners.
These are the best setups I found. I tried to follow the same approach as for ZK sidecar also for other pods. But for example Kafka first closes the listeners before the ZK connection, so the shutdown doesn't work so smoothly. This led me to the above mentioned combination.
- is related to
-
ENTMQST-1411 tls-sidecar can terminate earlier than Kafka container itself
- Closed