Uploaded image for project: 'Maistra'
  1. Maistra
  2. MAISTRA-1166

occasional downstream connection terminations when requests enter our cluster via our Istio Ingress gateway

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • maistra-1.1.8
    • None
    • istio
    • None
    • MAISTRA 1.1.8

      OpenShift version: 4.2.2
      OpenShift Service Mesh version: 1.0.2 (Istio: 1.1.11)

      We’ve been observing occasional downstream connection terminations when requests enter our cluster via our Istio Ingress gateway. These terminations only affect TLS requests, and appear to increase in frequency when the ingress gateway is under load.

      Here is the server definition in question from our Gateway resource definition:

      spec:
      selector:
      app: test-gw-ctrl-ingress
      servers:

      • hosts:
      • '*.int-priv.test.example.com'
        port:
        name: https
        number: 443
        protocol: HTTPS
        tls:
        caCertificates: /etc/istio/ingressgateway-ca-certs/tls.crt
        mode: SIMPLE
        privateKey: /etc/istio/ingressgateway-certs/tls.key
        serverCertificate: /etc/istio/ingressgateway-certs/tls.crt

      A fairly standard VirtualService definition exists for the upstream service we’re testing against:

      spec:
      gateways:

      • test-gw-ctrl/test-gw-ctrl
        hosts:
      • service-something-v1.int-ctrl.test.example.com
        http:
      • route:
      • destination:
        host: service-something-v1
        port:
        number: 8080

      We have been able to rule out the upstream service as the cause of these terminations, as exposing the upstream service via an OpenShift route (behind HAProxy) or simply targeting a different upstream service results in expected behaviour. Both the Istio gateway, as well as the HAProxy Ingress controller in the alternate scenario are behind AWS ELBs.

      Within Jaeger, the error flag that is set on the erroneous traces in question is “DC”. “DC” is also visible within the Istio ingress gateway access logs for the requests.

      From the client side, we simply get an end of stream / end of file response depending on which client is used:
      2020-02-04 19:03:19,366 DEBUG o.a.j.p.h.s.HTTPHC4Impl: Start : sample https://service-something-v1.int-ctrl.test.example.com/v1/somethings method GET followingRedirect false depth 0
      2020-02-04 19:03:19,366 DEBUG o.a.j.p.h.s.HTTPHC4Impl: Reusing the HttpClient: @1882043008 https://service-something-v1.int-ctrl.test.example.com
      2020-02-04 19:03:19,366 DEBUG o.a.j.p.h.s.HTTPHC4Impl: Found user token:Thread Group 1-1 as JMeter variable:_jmeter.U_T_, storing it in HttpContext
      2020-02-04 19:03:19,366 DEBUG o.a.h.c.p.RequestAddCookies: CookieSpec selected: default
      2020-02-04 19:03:19,366 DEBUG o.a.h.c.p.RequestAddCookies: Unsupported cookie policy: default
      2020-02-04 19:03:19,366 DEBUG o.a.h.c.p.RequestAuthCache: Auth cache not set in the context
      2020-02-04 19:03:19,366 DEBUG o.a.h.i.c.PoolingHttpClientConnectionManager: Connection request: [route:

      {s}

      ->https://service-something-v1.int-ctrl.test.example.com:443[state:|https://service-something-v1.int-ctrl.test.spint.bom.gov.au:443][state/] Thread Group 1-1][total kept alive: 1; route allocated: 1 of 2; total allocated: 1 of 20]
      2020-02-04 19:03:19,366 DEBUG o.a.h.i.c.PoolingHttpClientConnectionManager: Connection leased: [id: 1044][route: {s}

      ->https://service-something-v1.int-ctrl.test.spint.example.com:443[state:|https://service-something-v1.int-ctrl.test.spint.bom.gov.au:443][state/] Thread Group 1-1][total kept alive: 0; route allocated: 1 of 2; total allocated: 1 of 20]
      2020-02-04 19:03:19,366 DEBUG o.a.h.i.c.DefaultManagedHttpClientConnection: http-outgoing-1044: set socket timeout to 0
      2020-02-04 19:03:19,366 DEBUG o.a.h.i.c.DefaultManagedHttpClientConnection: http-outgoing-1044: set socket timeout to 10000
      2020-02-04 19:03:19,366 DEBUG o.a.h.i.e.MainClientExec: Executing request GET /v1/somethings HTTP/1.1
      2020-02-04 19:03:19,366 DEBUG o.a.h.i.e.MainClientExec: Target auth state: UNCHALLENGED
      2020-02-04 19:03:19,366 DEBUG o.a.h.i.e.MainClientExec: Proxy auth state: UNCHALLENGED
      2020-02-04 19:03:19,366 DEBUG o.a.h.headers: http-outgoing-1044 >> GET /v1/somethings HTTP/1.1
      2020-02-04 19:03:19,366 DEBUG o.a.h.headers: http-outgoing-1044 >> Connection: keep-alive
      2020-02-04 19:03:19,366 DEBUG o.a.h.headers: http-outgoing-1044 >> Host: service-something-v1.int-ctrl.test.example.com
      2020-02-04 19:03:19,366 DEBUG o.a.h.headers: http-outgoing-1044 >> User-Agent: Apache-HttpClient/4.5.10 (Java/1.8.0_232)
      2020-02-04 19:03:19,366 DEBUG o.a.h.wire: http-outgoing-1044 >> "GET /v1/somethings HTTP/1.1[\r][\n]"
      2020-02-04 19:03:19,366 DEBUG o.a.h.wire: http-outgoing-1044 >> "Connection: keep-alive[\r][\n]"
      2020-02-04 19:03:19,366 DEBUG o.a.h.wire: http-outgoing-1044 >> "Host: service-something-v1.int-ctrl.test.example.com[\r][\n]"
      2020-02-04 19:03:19,366 DEBUG o.a.h.wire: http-outgoing-1044 >> "User-Agent: Apache-HttpClient/4.5.10 (Java/1.8.0_232)[\r][\n]"
      2020-02-04 19:03:19,366 DEBUG o.a.h.wire: http-outgoing-1044 >> "[\r][\n]"
      2020-02-04 19:03:19,366 DEBUG o.a.j.p.h.s.HTTPHC4Impl: Sent 169 bytes
      2020-02-04 19:03:19,396 DEBUG o.a.h.wire: http-outgoing-1044 << "end of stream"
      2020-02-04 19:03:19,396 DEBUG o.a.h.i.c.DefaultManagedHttpClientConnection: http-outgoing-1044: Close connection
      2020-02-04 19:03:19,396 DEBUG o.a.h.i.c.DefaultManagedHttpClientConnection: http-outgoing-1044: Shutdown connection
      2020-02-04 19:03:19,396 DEBUG o.a.h.i.e.MainClientExec: Connection discarded
      This issue seems to be far more common when requests are issued from Apache HttpClient4 (JMeter 5.0 r1840935 specifically) based clients, despite sending an identical set of headers as other clients. Including / Omitting Keep-Alive headers in the requests doesn’t seem to make a difference.

      We suspect it may have something to do with the following issues:

      https://github.com/istio/istio/issues/12162
      https://github.com/istio/istio/issues/17139
      https://github.com/istio/istio/issues/17139

      Are you aware of any incidences similar to this? Are there any steps we can take to resolve this?

              kconner@redhat.com Kevin Conner (Inactive)
              rhn-support-misalunk Miheer Salunke
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: