When using cluster scaling and querying a URL in a pod on the side. All the while running some custom watches on endpoints and nodes.
When the nodes scale down, it seems a few seconds before an event marks the node as Not Ready and before the dns-default endpoint is removed from the endpoints list a DNS query can fail.
We wrote some simply watcher (see below for details) to log this and got the following events:
DNS lookup failure:
The node was not yet Not Ready and the endpoint was still in the list of endpoints at that time (ntrdy indicates a NotReadyEndpoint):
So we can observe that the node goes into 'Unknown' at 12:33:39, and the endpoint goes into Not Ready soon after.
Not sure if this is a logic problem of draining a node or an issue with the autoscaler at this point in time, but it fixes itself at the next lookup 5 seconds later.
Detailed breakdown of how this was reproduced:
1. A cluster with autoscaling enabled is required.
2. Deploy a daemonset that attempt to use DNS / HTTP in a loop, e.g. the following Daemonset was used to test this:
3. Run the following go program against the same cluster (this is what watches the node and endpoint events for the dns-default endpoints):
4. Create simulated pressure on the nodes to force a scaleup - e.g. use the following deployment:
5. Wait for the scale up to happen.
6. Delete the deployment that created the node pressure, so the scale down can happen (this can easily take 15 minutes).
7. Observe the events in the watcher program and the logs for the daemonset - this should show the same behavior as detailed above.
we found a few logged bugs that seemed related to this issue affecting clusters on 4.8 through 4.10. Those bugs are as follows:
Using the boave mentioned steps, we have been able to reliably reproduce the issue of DNS failures during autoscale-down in 4.10 clusters.