-
Bug
-
Resolution: Duplicate
-
Normal
-
None
-
4.17.z, 4.16.z, 4.18.z, 4.19.0
-
Quality / Stability / Reliability
-
False
-
-
2
-
Important
-
None
-
None
-
None
-
Rejected
-
NI&D Sprint 273
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
- Issue Overview: If a client opens a web browser session to a backend application, then scales down and back up the application stack, the client will re-use the previous connection through the loadbalancer if refreshed within 1m of those pods being deleted, and as a result, we will stall for 20s (or longer, depending on browser/caching rules) before trying a new port and getting a successful return. Haproxy does not inform the client that backends are lost, and so code changes can be highly disruptive for frontend-facing web sessions.
- Issue Detail:
- When a user connects to a webservice running in OpenShift via a loadbalancer, we will open a frontend connection using keepalive to the haproxy router pod running on the cluster, which will then pass the request in a backend connection to the actual application to return the 200 response. Deleting ALL the backend pods at once (or just the original pod if only one backend for a given route) , will lead to a 20s (or longer) delay when hitting "refresh" in the browser if performed within 1 minute of the pods being
- TRACKED BEHAVIOR: when we have 2 (or more) frontend pods that are all removed at once: (all backends for a route are removed) - new haproxy.config is populated with entirely new set of pod IPs:
- If we then scale down and back up (or delete the backends) on the cluster, and hit REFRESH on the browser within 1 minutes (might be closer to 50s, but I can consistently get it to replicate at 30-45s) of the backend pods being deleted - the following occurs
- Keepalive packets are going to maintain that connection with the haproxy pid every 10s or so. These packets are acked by haproxy and so the connection stays open from client --> LB --> router pod.
- When the user presses "refresh", The client will re-try the same port it was using with keepalived and attempt to repopulate the context of the webpage from the service application pod. (a new GET is sent on the same session/connection/port)
- This active session is tied to the PID running in the router that WAS actively tied to the socket when the connection was made (and therefore has the old haproxy.config). It will try (and fail) to connect to the same backend(s) for 15s then probe any other backends available for 5s before declaring no hosts available, and timing out the call at 20s.
- Router pod ships back a 503 at the 20s marker after failing to contact the known backends. Backends are marked as DOWN.
- Client browser will immediately ship a NEW SYN to the loadbalancer with the header information from the previous request, asking for a fresh GET to the web page.
- Because this second request originates from a New client port, it is tied in to the router pod at the new haproxy pid tied to the socket (which has the latest haproxy.config data) and gets passed to the new backends successfully.
- A 200 is returned to the customer within miliseconds after this second call is shipped (20.2s after the refresh was pressed - approx).
- We can reduce the length of time for this 20s call down by reducing the time spent waiting for health probes on the application per route, or reducing the connectTimeout value on the ingresscontroller.
- It is not possible to skip the 20s load time unless we open a new tab/window/private window and make a new call to the web page. (which loads immediately - under 1s for the new pods to come up).
- Curls to the route will always succeed in under a second unless we re-use the same port on a subsequent call, which results in the same 20s behavior.
- If we then scale down and back up (or delete the backends) on the cluster, and hit REFRESH on the browser within 1 minutes (might be closer to 50s, but I can consistently get it to replicate at 30-45s) of the backend pods being deleted - the following occurs
- Additional Note here:
- Client handling of this timeout of 20s is VARIABLE. Certain browsers will retry that same port a SECOND TIME, or re-use via pooling OTHER connections via the loadbalancer, before opening a new client connection, which can lead to a 40s delay or longer - depending on cache handling rules and default configs:
Firefox, my main browser -> consistently 40s in response time. Firefox on mobile -> consistently ~40s in response time. Chrome on mobile -> 1 minute response time. 20 seconds. 40 seconds. 40 seconds. Safari -> Consistently 20s. Chrome -> Observed a 40s load TWICE, the rest being 20s.
- Client handling of this timeout of 20s is VARIABLE. Certain browsers will retry that same port a SECOND TIME, or re-use via pooling OTHER connections via the loadbalancer, before opening a new client connection, which can lead to a 40s delay or longer - depending on cache handling rules and default configs:
-
- it's important to understand that If I have 10 pods, and delete the pod that the client is connected to out of that 10, we will STILL experience this issue, because haproxy will always reload when there is a change, and the loadbalancer/client is NOT INFORMED that the backend is terminated/missing, until a client request forces the router pod to evaluate health status on the application (where it can return the 503 after determining the backend is missing).
- We do not have a handler in place for terminating a client-side connection with a proactive RST. The L4check that is performed for pods (which is not set for all route types), is insufficient at closing connections when the backend has been terminated, relying only upon the internal timeout that occurs within 1m when the pod fails it's health probe from the router.
- Possible solution:
-
- It may be possible to mitigate this (but not sure about feasability) with a handler that performs a diff check on the current haproxy.config and the previous haproxy.config (or the .map) . If a backend is missing, any connections that were open to that backend IP could be forcibly closed with an RST to the client, informing the backend was lost, and forcing a retry/reconnect (Which would reduce downtime significantly during a change period) --> We might can pursue that as an RFE but filing this as a bug for engineering review on the problem as a whole first.
- Version: OpenShift 4.17.16, but have replicated this on all current versions of Open Shift and is platform agnostic. (Customer is on Azure, I have replicated there and on quicklab, openstack, aws).
//WORKAROUNDS:
It may be possible to limit the impact for customer experience via the following:
- blue/green deployment solutions to perform sliding rollovers
- argoCD/gitops/devops pipelines that perform rolling restarts and slow migrations to changed builds
- Deployment strategies and pod disruption budgets that prevent all pods going down at once
- Scheduling downtime windows for rollovers
//REPLICATING THE BEHAVIOR INTERNALLY:
Easy replicator:
- Deploy a new namespace and a test httpd instance
oc new-project rh-testing oc new-app httpd:latest~https://github.com/Scotchman0/httpd-ex oc scale deployment/httpd-ex --replicas=2 oc expose svc/httpd-ex oc get route # open a web browser and load the webpage # delete the httpd-ex-<string> pods and observe they are repopulated: oc delete pods --all && oc get pods # refresh the web browser (within 45s of that deletion request or so - I'm pretty sure within 1m we'll see it but this is a safe margin) and observe a 20s delay on that load time before the page is re-populated. (You can track the load time using the development panel for the browser to see how long we spent waiting for the request to be returned.)
- duplicates
-
OCPBUGS-43745 Route update does not work correctly in a multiple EAP clusters environment
-
- Closed
-
- is duplicated by
-
OCPBUGS-56142 OCP4: Haproxy does not handle loss of backends
-
- Closed
-
-
OCPBUGS-56143 OCP4: Haproxy does not handle loss of backends efficiently
-
- Closed
-