Loading...

Type: Bug
Resolution: Duplicate
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.17.z, 4.16.z, 4.18.z, 4.19.0
Component/s: Networking / router
Labels:
- ne-triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
2
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
NI&D Sprint 273
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Issue Overview: If a client opens a web browser session to a backend application, then scales down and back up the application stack, the client will re-use the previous connection through the loadbalancer if refreshed within 1m of those pods being deleted, and as a result, we will stall for 20s (or longer, depending on browser/caching rules) before trying a new port and getting a successful return. Haproxy does not inform the client that backends are lost, and so code changes can be highly disruptive for frontend-facing web sessions.

Issue Detail:
- When a user connects to a webservice running in OpenShift via a loadbalancer, we will open a frontend connection using keepalive to the haproxy router pod running on the cluster, which will then pass the request in a backend connection to the actual application to return the 200 response. Deleting ALL the backend pods at once (or just the original pod if only one backend for a given route) , will lead to a 20s (or longer) delay when hitting "refresh" in the browser if performed within 1 minute of the pods being

TRACKED BEHAVIOR: when we have 2 (or more) frontend pods that are all removed at once: (all backends for a route are removed) - new haproxy.config is populated with entirely new set of pod IPs:
- If we then scale down and back up (or delete the backends) on the cluster, and hit REFRESH on the browser within 1 minutes (might be closer to 50s, but I can consistently get it to replicate at 30-45s) of the backend pods being deleted - the following occurs
  - Keepalive packets are going to maintain that connection with the haproxy pid every 10s or so. These packets are acked by haproxy and so the connection stays open from client --> LB --> router pod.
  - When the user presses "refresh", The client will re-try the same port it was using with keepalived and attempt to repopulate the context of the webpage from the service application pod. (a new GET is sent on the same session/connection/port)
  - This active session is tied to the PID running in the router that WAS actively tied to the socket when the connection was made (and therefore has the old haproxy.config). It will try (and fail) to connect to the same backend(s) for 15s then probe any other backends available for 5s before declaring no hosts available, and timing out the call at 20s.
  - Router pod ships back a 503 at the 20s marker after failing to contact the known backends. Backends are marked as DOWN.
  - Client browser will immediately ship a NEW SYN to the loadbalancer with the header information from the previous request, asking for a fresh GET to the web page.
  - Because this second request originates from a New client port, it is tied in to the router pod at the new haproxy pid tied to the socket (which has the latest haproxy.config data) and gets passed to the new backends successfully.
  - A 200 is returned to the customer within miliseconds after this second call is shipped (20.2s after the refresh was pressed - approx).
  - We can reduce the length of time for this 20s call down by reducing the time spent waiting for health probes on the application per route, or reducing the connectTimeout value on the ingresscontroller.
  - It is not possible to skip the 20s load time unless we open a new tab/window/private window and make a new call to the web page. (which loads immediately - under 1s for the new pods to come up).
  - Curls to the route will always succeed in under a second unless we re-use the same port on a subsequent call, which results in the same 20s behavior.

Additional Note here:
- Client handling of this timeout of 20s is VARIABLE. Certain browsers will retry that same port a SECOND TIME, or re-use via pooling OTHER connections via the loadbalancer, before opening a new client connection, which can lead to a 40s delay or longer - depending on cache handling rules and default configs:
```
Firefox, my main browser -> consistently 40s in response time. 
Firefox on mobile -> consistently ~40s in response time.
Chrome on mobile -> 1 minute response time. 20 seconds. 40 seconds. 40 seconds.
Safari -> Consistently 20s.
Chrome -> Observed a 40s load TWICE, the rest being 20s. 
```

- it's important to understand that If I have 10 pods, and delete the pod that the client is connected to out of that 10, we will STILL experience this issue, because haproxy will always reload when there is a change, and the loadbalancer/client is NOT INFORMED that the backend is terminated/missing, until a client request forces the router pod to evaluate health status on the application (where it can return the 503 after determining the backend is missing).
- We do not have a handler in place for terminating a client-side connection with a proactive RST. The L4check that is performed for pods (which is not set for all route types), is insufficient at closing connections when the backend has been terminated, relying only upon the internal timeout that occurs within 1m when the pod fails it's health probe from the router.

Possible solution:

- It may be possible to mitigate this (but not sure about feasability) with a handler that performs a diff check on the current haproxy.config and the previous haproxy.config (or the .map) . If a backend is missing, any connections that were open to that backend IP could be forcibly closed with an RST to the client, informing the backend was lost, and forcing a retry/reconnect (Which would reduce downtime significantly during a change period) --> We might can pursue that as an RFE but filing this as a bug for engineering review on the problem as a whole first.

Version: OpenShift 4.17.16, but have replicated this on all current versions of Open Shift and is platform agnostic. (Customer is on Azure, I have replicated there and on quicklab, openstack, aws).

//WORKAROUNDS:

It may be possible to limit the impact for customer experience via the following:

blue/green deployment solutions to perform sliding rollovers
argoCD/gitops/devops pipelines that perform rolling restarts and slow migrations to changed builds
Deployment strategies and pod disruption budgets that prevent all pods going down at once
Scheduling downtime windows for rollovers

//REPLICATING THE BEHAVIOR INTERNALLY:

Easy replicator:

Deploy a new namespace and a test httpd instance

oc new-project rh-testing
oc new-app httpd:latest~https://github.com/Scotchman0/httpd-ex
oc scale deployment/httpd-ex --replicas=2
oc expose svc/httpd-ex
oc get route

# open a web browser and load the webpage

# delete the httpd-ex-<string> pods and observe they are repopulated:

oc delete pods --all && oc get pods 

# refresh the web browser (within 45s of that deletion request or so - I'm pretty sure within 1m we'll see it but this is a safe margin) and observe a 20s delay on that load time before the page is re-populated.

(You can track the load time using the development panel for the browser to see how long we spent waiting for the request to be returned.)

duplicates

OCPBUGS-43745 Route update does not work correctly in a multiple EAP clusters environment

Closed

is duplicated by

OCPBUGS-56142 OCP4: Haproxy does not handle loss of backends

Closed

OCPBUGS-56143 OCP4: Haproxy does not handle loss of backends efficiently

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide