-
Bug
-
Resolution: Done
-
Blocker
-
None
-
None
OpenShift version: 4.2.2
OpenShift Service Mesh version: 1.0.2 (Istio: 1.1.11)
We’ve been observing occasional downstream connection terminations when requests enter our cluster via our Istio Ingress gateway. These terminations only affect TLS requests, and appear to increase in frequency when the ingress gateway is under load.
Here is the server definition in question from our Gateway resource definition:
spec:
selector:
app: test-gw-ctrl-ingress
servers:
- hosts:
- '*.int-priv.test.example.com'
port:
name: https
number: 443
protocol: HTTPS
tls:
caCertificates: /etc/istio/ingressgateway-ca-certs/tls.crt
mode: SIMPLE
privateKey: /etc/istio/ingressgateway-certs/tls.key
serverCertificate: /etc/istio/ingressgateway-certs/tls.crt
A fairly standard VirtualService definition exists for the upstream service we’re testing against:
spec:
gateways:
- test-gw-ctrl/test-gw-ctrl
hosts: - service-something-v1.int-ctrl.test.example.com
http: - route:
- destination:
host: service-something-v1
port:
number: 8080
We have been able to rule out the upstream service as the cause of these terminations, as exposing the upstream service via an OpenShift route (behind HAProxy) or simply targeting a different upstream service results in expected behaviour. Both the Istio gateway, as well as the HAProxy Ingress controller in the alternate scenario are behind AWS ELBs.
Within Jaeger, the error flag that is set on the erroneous traces in question is “DC”. “DC” is also visible within the Istio ingress gateway access logs for the requests.
From the client side, we simply get an end of stream / end of file response depending on which client is used:
2020-02-04 19:03:19,366 DEBUG o.a.j.p.h.s.HTTPHC4Impl: Start : sample https://service-something-v1.int-ctrl.test.example.com/v1/somethings method GET followingRedirect false depth 0
2020-02-04 19:03:19,366 DEBUG o.a.j.p.h.s.HTTPHC4Impl: Reusing the HttpClient: @1882043008 https://service-something-v1.int-ctrl.test.example.com
2020-02-04 19:03:19,366 DEBUG o.a.j.p.h.s.HTTPHC4Impl: Found user token:Thread Group 1-1 as JMeter variable:_jmeter.U_T_, storing it in HttpContext
2020-02-04 19:03:19,366 DEBUG o.a.h.c.p.RequestAddCookies: CookieSpec selected: default
2020-02-04 19:03:19,366 DEBUG o.a.h.c.p.RequestAddCookies: Unsupported cookie policy: default
2020-02-04 19:03:19,366 DEBUG o.a.h.c.p.RequestAuthCache: Auth cache not set in the context
2020-02-04 19:03:19,366 DEBUG o.a.h.i.c.PoolingHttpClientConnectionManager: Connection request: [route:
->https://service-something-v1.int-ctrl.test.example.com:443[state:|https://service-something-v1.int-ctrl.test.spint.bom.gov.au:443][state/] Thread Group 1-1][total kept alive: 1; route allocated: 1 of 2; total allocated: 1 of 20]
2020-02-04 19:03:19,366 DEBUG o.a.h.i.c.PoolingHttpClientConnectionManager: Connection leased: [id: 1044][route: {s}
->https://service-something-v1.int-ctrl.test.spint.example.com:443[state:|https://service-something-v1.int-ctrl.test.spint.bom.gov.au:443][state/] Thread Group 1-1][total kept alive: 0; route allocated: 1 of 2; total allocated: 1 of 20]
2020-02-04 19:03:19,366 DEBUG o.a.h.i.c.DefaultManagedHttpClientConnection: http-outgoing-1044: set socket timeout to 0
2020-02-04 19:03:19,366 DEBUG o.a.h.i.c.DefaultManagedHttpClientConnection: http-outgoing-1044: set socket timeout to 10000
2020-02-04 19:03:19,366 DEBUG o.a.h.i.e.MainClientExec: Executing request GET /v1/somethings HTTP/1.1
2020-02-04 19:03:19,366 DEBUG o.a.h.i.e.MainClientExec: Target auth state: UNCHALLENGED
2020-02-04 19:03:19,366 DEBUG o.a.h.i.e.MainClientExec: Proxy auth state: UNCHALLENGED
2020-02-04 19:03:19,366 DEBUG o.a.h.headers: http-outgoing-1044 >> GET /v1/somethings HTTP/1.1
2020-02-04 19:03:19,366 DEBUG o.a.h.headers: http-outgoing-1044 >> Connection: keep-alive
2020-02-04 19:03:19,366 DEBUG o.a.h.headers: http-outgoing-1044 >> Host: service-something-v1.int-ctrl.test.example.com
2020-02-04 19:03:19,366 DEBUG o.a.h.headers: http-outgoing-1044 >> User-Agent: Apache-HttpClient/4.5.10 (Java/1.8.0_232)
2020-02-04 19:03:19,366 DEBUG o.a.h.wire: http-outgoing-1044 >> "GET /v1/somethings HTTP/1.1[\r][\n]"
2020-02-04 19:03:19,366 DEBUG o.a.h.wire: http-outgoing-1044 >> "Connection: keep-alive[\r][\n]"
2020-02-04 19:03:19,366 DEBUG o.a.h.wire: http-outgoing-1044 >> "Host: service-something-v1.int-ctrl.test.example.com[\r][\n]"
2020-02-04 19:03:19,366 DEBUG o.a.h.wire: http-outgoing-1044 >> "User-Agent: Apache-HttpClient/4.5.10 (Java/1.8.0_232)[\r][\n]"
2020-02-04 19:03:19,366 DEBUG o.a.h.wire: http-outgoing-1044 >> "[\r][\n]"
2020-02-04 19:03:19,366 DEBUG o.a.j.p.h.s.HTTPHC4Impl: Sent 169 bytes
2020-02-04 19:03:19,396 DEBUG o.a.h.wire: http-outgoing-1044 << "end of stream"
2020-02-04 19:03:19,396 DEBUG o.a.h.i.c.DefaultManagedHttpClientConnection: http-outgoing-1044: Close connection
2020-02-04 19:03:19,396 DEBUG o.a.h.i.c.DefaultManagedHttpClientConnection: http-outgoing-1044: Shutdown connection
2020-02-04 19:03:19,396 DEBUG o.a.h.i.e.MainClientExec: Connection discarded
This issue seems to be far more common when requests are issued from Apache HttpClient4 (JMeter 5.0 r1840935 specifically) based clients, despite sending an identical set of headers as other clients. Including / Omitting Keep-Alive headers in the requests doesn’t seem to make a difference.
We suspect it may have something to do with the following issues:
https://github.com/istio/istio/issues/12162
https://github.com/istio/istio/issues/17139
https://github.com/istio/istio/issues/17139
Are you aware of any incidences similar to this? Are there any steps we can take to resolve this?