-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
4.18, 4.19
-
Quality / Stability / Reliability
-
False
-
-
2
-
Important
-
None
-
None
-
None
-
Rejected
-
NI&D Sprint 268
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When a service is updated on a route, the change is processed as two separate events: route removal and route addition (https://github.com/openshift/router/blob/7a688b0eab5a27fe13988a21022c774cdeb964b2/pkg/router/template/router.go#L1090-L1108). In a DCM-enabled router, route removal is handled by disabling all associated servers while the route addition cannot be done dynamically and results in the router reload. If the idle-close-on-response option is enabled, this can lead to a 503 error when a client attempts to reuse a persistent connection. The issue occurs because idle connections are not reset immediately but are deferred until the last request is handled (as per HAProxy's idle-close-on-response behavior). This causes the old HAProxy process to remain active with all servers in maintenance mode for the affected route. As a result, any subsequent request on the persistent connection to this route receives a 503 error, while all later requests correctly reach the new service endpoint(s).
Version-Release number of selected component (if applicable):
4.18+ (when TechPreview featureset is used)
How reproducible:
Always
Steps to Reproduce:
1. Make sure to use the router with Dynamic Configuration Manager enabled. 2. Make sure idle-close-on-response option is set in the router's configuration. 3. Make sure to use an http client which reuses connections. 4. Send an http request to a route. 5. Change the router's "to" service to another one with a ready endpoint. 6. Send another http request to the same route.
Actual results:
503 error for the first request. All subsequent requests succeed and come from the new endpoint.
Expected results:
The first request should go to the old endpoint. All subsequent requests succeed and come from the new endpoint.
Additional info:
Potential fix: fallback to the router reload for the route update event.
- blocks
-
NE-1874 Graduate Dynamic Config Manager to GA
-
- New
-
- is cloned by
-
OCPBUGS-49908 [DCM] Service idling fails due to non disabled health check on server slot with clusterIP address
-
- Closed
-
- is incorporated by
-
NE-1984 Review and address and bugs that were deferred from Tech Preview
-
- To Do
-