Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 1.31.0
Affects Version/s: 1.27.0, 1.28.0, 1.29.0
Component/s: None
Labels:
None

Epic Link:
SRVCOM-2606
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Context

There is an issue with Knatives draining behaviour when using Service Mesh with full mTLS enforcing (mode=STRICT).

This is why tests like `TestDestroyPodInflight` are super flaky here: https://github.com/openshift-knative/serverless-operator/pull/2138.

Using Kourier everything works like expected
Using istio as an ingress-controller only (no mesh) everything works like expected

Reproduce

See https://gist.github.com/ReToCode/e0b2a8d0d91809bf489caa6e26d287ca#file-reproduce-md

Problem 1.1: Envoy terminates too early

Knative drains connections like this

QP catches SIGTERM and sleeps 30 seconds before exiting
The user-container gets a PreStopHook that is called by K8S before sending the SIGTERM. The Hook is mapped to port 8022 on QP. QP again, waits for the 30 seconds, then releases the hook, so K8S will terminate the user-container

When using istio as a mesh, injects istio-proxy to every Knative Serving pod, istio-proxy does not know about this behaviour. The drain-period is 5s, so istio-proxy terminates before QP and user-container.

2023-07-13T09:10:28.706951023Z istio-proxy 2023-07-13T09:10:28.706875Z    warn    Envoy proxy is NOT ready: server is not live, current state is: DRAINING
2023-07-13T09:10:30.707229025Z istio-proxy 2023-07-13T09:10:30.707137Z    warn    Envoy proxy is NOT ready: server is not live, current state is: DRAINING
2023-07-13T09:10:32.706613579Z istio-proxy 2023-07-13T09:10:32.706571Z    warn    Envoy proxy is NOT ready: server is not live, current state is: DRAINING
2023-07-13T09:10:32.768848101Z istio-proxy 2023-07-13T09:10:32.768775Z    warn    Aborted proxy instance
2023-07-13T09:10:32.768848101Z istio-proxy 2023-07-13T09:10:32.768811Z    warn    Aborting proxy
2023-07-13T09:10:32.772912654Z queue-proxy {"httpRequest": {"requestMethod": "GET", "requestUrl": "/?timeout\u003D35000", "requestSize": "0", "status": 500, "responseSize": "0", "userAgent": "curl/7.88.1", "remoteIp": "127.0.0.6:36655", "serverIp": "10.128.1.19", "referer": "", "latency": "22.271729538s", "protocol": "HTTP/1.1"}, "traceId": "[61686e74c8b1adaf69fc2b3cb32ea02b]"}
2023-07-13T09:10:33.742662Z    Stream closed EOF for serving-tests/timeout-00001-deployment-6f477568bb-rfwq5 (istio-proxy)
2023-07-13T09:10:45.502117924Z user-container After sleep
2023-07-13T09:10:45.502117924Z user-container After response write
2023-07-13T09:10:45.761490737Z user-container Server stopped

The istio-proxy is gone at "Stream closed EOF for serving-tests/timeout-00001-deployment-6f477568bb-rfwq5 (istio-proxy)" and the user-container responds after that.

Problem 1.2: TERMINATION_DRAIN_DURATION_SECONDS does not work any longer

The env variable `TERMINATION_DRAIN_DURATION_SECONDS` no longer has an effect, the default drain-behaviour is applied:

// TERMINATION_DRAIN_DURATION_SECONDS = 20
"PROXY_CONFIG": {
...
  "terminationDrainDuration": "5s",
...
}

// terminationDrainDuration set in meshConfig to 35s
"PROXY_CONFIG": {
...
  "terminationDrainDuration": "35s", 
...
}

It seems that upstream istio dropped that feature and downstream, the meshConfig is only available in techPreview section:

Problem 2: with strict mTLS enabled, K8S cannot call the PreStopHook

We see FailedPreStopHook as an event, and the user-container gets the SIGTERM immediately. The existing request is terminated too early.

Full logs: https://gist.github.com/ReToCode/e0b2a8d0d91809bf489caa6e26d287ca

Problem 3: IOR creates an additional OpenShift Route for a Gateway in BYOCertificate + DomainMapping

Maistra has a (deprecated, but still active) component that automatically creates OpenShift Routes for a Istio-Gateway that has a host. Unfortunately, there is no opt-out via label/annotation on this.

Source-Code: https://github.com/maistra/ior/

The result is, that we have two OpenShift routes for a BYODCertificate on a Domain-Mapping that are conflicting. One points to the knative-local-gateway (wrong!) and one points to the external-gateway (correct). If the IOR Route is faster, the requests get routed to the wrong gateway. That does work fine, as long as we do not enforce mTLS. If we enforce it and land on the wrong gateway without mTLS, requests are dropped and we get EOF.

Full YAMLs of this: https://gist.github.com/ReToCode/e33c93d7cf7d4a02a34afc151c053508

Solution part 1: use portLevelMtls for port 8022 or disable the istio-proxy on port 8022

Unfortunately, we either need to have this for every Knative Service:

---
apiVersion: "security.istio.io/v1beta1"
kind: "PeerAuthentication"
metadata:
  name: "allow-wait-for-drain-non-tls"
  namespace: "serving-tests"
spec:
  selector:
    matchLabels:
      serving.knative.dev/service: timeout
  portLevelMtls:
    "8022":
      mode: PERMISSIVE
---

OR

we disable istio-proxy on that port

proxy:
  networking:
    trafficControl:
      inbound:
        excludedPorts:
        - 8022

Additionally, we also need an AuthorizationPolicy in place, that requires all traffic to be authenticated, we also need to allow unauthenticated traffic to port 8022:

---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allow-traffic-to-drain-port
  namespace: serving-tests
spec:
  action: ALLOW
  rules:
  - to:
      - operation:
          ports: [ "8022" ]
---

Solution part 2: increase envoy drain timeout via techPreview meshConfig feature

Unfortunately, SM does not (yet) support that. There is a techPreview flag which does work:

apiVersion: maistra.io/v2
kind: ServiceMeshControlPlane
metadata:
  name: basic
  namespace: istio-system
spec:
  techPreview:
    meshConfig:
      defaultConfig:
        terminationDrainDuration: 35s

The terminationDrainDuration has to be bigger than what we have in Knative (always 30s)

Solution part 3: Disable IOR until it is fully removed from OSSM or annotate our gateways

We can either disable IOR completely

spec:
  gateways:
    openshiftRoute:
      enabled: false

or annotate our gateways with:

metadata:
  annotations: 
    maistra.io/manageRoute: false

Verification

Approach 1, smcp.mtls: false, create our own PeerAuthentication: https://github.com/openshift-knative/serverless-operator/pull/2138
Approach 2, smcp.mtls: true: https://github.com/openshift-knative/serverless-operator/pull/2171
Just enable mTLS without AuthorizationPolicies: https://github.com/openshift-knative/serverless-operator/pull/2173
Final setup: https://github.com/openshift-knative/serverless-operator/pull/2228

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates