Uploaded image for project: 'Knative Serving'
  1. Knative Serving
  2. SRVKS-1099

Draining behaviour with istio in mTLS mesh (mode=STRICT) does not work

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 1.31.0
    • 1.27.0, 1.28.0, 1.29.0
    • None
    • None
    • False
    • None
    • False

      Context

      There is an issue with Knatives draining behaviour when using Service Mesh with full mTLS enforcing (mode=STRICT).

      This is why tests like `TestDestroyPodInflight` are super flaky here: https://github.com/openshift-knative/serverless-operator/pull/2138.

      • Using Kourier everything works like expected
      • Using istio as an ingress-controller only (no mesh) everything works like expected

       

      Reproduce

      See https://gist.github.com/ReToCode/e0b2a8d0d91809bf489caa6e26d287ca#file-reproduce-md 

       

      Problem 1.1: Envoy terminates too early

      Knative drains connections like this

      • QP catches SIGTERM and sleeps 30 seconds before exiting
      • The user-container gets a PreStopHook that is called by K8S before sending the SIGTERM. The Hook is mapped to port 8022 on QP. QP again, waits for the 30 seconds, then releases the hook, so K8S will terminate the user-container

      When using istio as a mesh, injects istio-proxy to every Knative Serving pod, istio-proxy does not know about this behaviour. The drain-period is 5s, so istio-proxy terminates before QP and user-container. 

      2023-07-13T09:10:28.706951023Z istio-proxy 2023-07-13T09:10:28.706875Z    warn    Envoy proxy is NOT ready: server is not live, current state is: DRAINING
      2023-07-13T09:10:30.707229025Z istio-proxy 2023-07-13T09:10:30.707137Z    warn    Envoy proxy is NOT ready: server is not live, current state is: DRAINING
      2023-07-13T09:10:32.706613579Z istio-proxy 2023-07-13T09:10:32.706571Z    warn    Envoy proxy is NOT ready: server is not live, current state is: DRAINING
      2023-07-13T09:10:32.768848101Z istio-proxy 2023-07-13T09:10:32.768775Z    warn    Aborted proxy instance
      2023-07-13T09:10:32.768848101Z istio-proxy 2023-07-13T09:10:32.768811Z    warn    Aborting proxy
      2023-07-13T09:10:32.772912654Z queue-proxy {"httpRequest": {"requestMethod": "GET", "requestUrl": "/?timeout\u003D35000", "requestSize": "0", "status": 500, "responseSize": "0", "userAgent": "curl/7.88.1", "remoteIp": "127.0.0.6:36655", "serverIp": "10.128.1.19", "referer": "", "latency": "22.271729538s", "protocol": "HTTP/1.1"}, "traceId": "[61686e74c8b1adaf69fc2b3cb32ea02b]"}
      2023-07-13T09:10:33.742662Z    Stream closed EOF for serving-tests/timeout-00001-deployment-6f477568bb-rfwq5 (istio-proxy)
      2023-07-13T09:10:45.502117924Z user-container After sleep
      2023-07-13T09:10:45.502117924Z user-container After response write
      2023-07-13T09:10:45.761490737Z user-container Server stopped 

      The istio-proxy is gone at "Stream closed EOF for serving-tests/timeout-00001-deployment-6f477568bb-rfwq5 (istio-proxy)" and the user-container responds after that.

       

      Problem 1.2: TERMINATION_DRAIN_DURATION_SECONDS does not work any longer

      The env variable `TERMINATION_DRAIN_DURATION_SECONDS` no longer has an effect, the default drain-behaviour is applied:

      // TERMINATION_DRAIN_DURATION_SECONDS = 20
      "PROXY_CONFIG": {
      ...
        "terminationDrainDuration": "5s",
      ...
      }
      
      // terminationDrainDuration set in meshConfig to 35s
      "PROXY_CONFIG": {
      ...
        "terminationDrainDuration": "35s", 
      ...
      }

      It seems that upstream istio dropped that feature and downstream, the meshConfig is only available in techPreview section:

       

       

      Problem 2: with strict mTLS enabled, K8S cannot call the PreStopHook

      We see FailedPreStopHook as an event, and the user-container gets the SIGTERM immediately. The existing request is terminated too early.

       

      Full logs: https://gist.github.com/ReToCode/e0b2a8d0d91809bf489caa6e26d287ca

       

      Problem 3: IOR creates an additional OpenShift Route for a Gateway in BYOCertificate + DomainMapping

      Maistra has a (deprecated, but still active) component that automatically creates OpenShift Routes for a Istio-Gateway that has a host. Unfortunately, there is no opt-out via label/annotation on this.

      Source-Code: https://github.com/maistra/ior/ 

      The result is, that we have two OpenShift routes for a BYODCertificate on a Domain-Mapping that are conflicting. One points to the knative-local-gateway (wrong!) and one points to the external-gateway (correct). If the IOR Route is faster, the requests get routed to the wrong gateway. That does work fine, as long as we do not enforce mTLS. If we enforce it and land on the wrong gateway without mTLS, requests are dropped and we get EOF.

       

      Full YAMLs of this: https://gist.github.com/ReToCode/e33c93d7cf7d4a02a34afc151c053508 

       

      Solution part 1: use portLevelMtls for port 8022 or disable the istio-proxy on port 8022

      Unfortunately, we either need to have this for every Knative Service:

      ---
      apiVersion: "security.istio.io/v1beta1"
      kind: "PeerAuthentication"
      metadata:
        name: "allow-wait-for-drain-non-tls"
        namespace: "serving-tests"
      spec:
        selector:
          matchLabels:
            serving.knative.dev/service: timeout
        portLevelMtls:
          "8022":
            mode: PERMISSIVE
      --- 

      OR

      we disable istio-proxy on that port

      proxy:
        networking:
          trafficControl:
            inbound:
              excludedPorts:
              - 8022 

      Additionally, we also need an AuthorizationPolicy in place, that requires all traffic to be authenticated, we also need to allow unauthenticated traffic to port 8022:

      ---
      apiVersion: security.istio.io/v1beta1
      kind: AuthorizationPolicy
      metadata:
        name: allow-traffic-to-drain-port
        namespace: serving-tests
      spec:
        action: ALLOW
        rules:
        - to:
            - operation:
                ports: [ "8022" ]
      ---
       

       

      Solution part 2: increase envoy drain timeout via techPreview meshConfig feature

      Unfortunately, SM does not (yet) support that. There is a techPreview flag which does work:

      apiVersion: maistra.io/v2
      kind: ServiceMeshControlPlane
      metadata:
        name: basic
        namespace: istio-system
      spec:
        techPreview:
          meshConfig:
            defaultConfig:
              terminationDrainDuration: 35s 

      The terminationDrainDuration has to be bigger than what we have in Knative (always 30s)

       

      Solution part 3: Disable IOR until it is fully removed from OSSM or annotate our gateways

      We can either disable IOR completely 

      spec:
        gateways:
          openshiftRoute:
            enabled: false 

      or annotate our gateways with:

      metadata:
        annotations: 
          maistra.io/manageRoute: false 

       

      Verification

       

              rh-ee-rlehmann Reto Lehmann
              rh-ee-rlehmann Reto Lehmann
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: