Uploaded image for project: 'Maistra'
  1. Maistra
  2. MAISTRA-833

istio-pilot pod has to be restarted periodically to keep Knative services functional

    Details

    • Type: Bug
    • Status: Released (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: maistra-1.0.0
    • Fix Version/s: maistra-1.0.2
    • Component/s: pilot
    • Labels:
      None
    • Sprint:
      MAISTRA 1.0.2

      Description

      This is a tracking bug for https://jira.coreos.com/browse/SRVKS-213. The serverless team considers this a blocker and is trying to reproduce the issue in a cluster that can be debugged. It doesn't seem like this requires anything Knative-specific at this point, so it should happen for other service mesh customers. The following description is copied from that issue:

      Knative Service starts to return the 503 status code when the cluster is running for a longer time. The duration is random.

      The Knative service and its route shows "Ready" but sending an HTTP request to the route returns 503. Knative Serving pods do not show any errors.

      Restarting the istio-pilot pod fixes the problem. The relevant part of istio-pilot logs between the moment when the service was still available and when it started to return 503:

      2019-08-13T05:27:03.479786Z info ServiceMeshMemberRoll default updated, namespaces now ["serving-tests" "knative-serving" "knative-eventing" "knative-build" "istio-system"]2019-08-13T05:27:03.479786Z info ServiceMeshMemberRoll default updated, namespaces now ["serving-tests" "knative-serving" "knative-eventing" "knative-build" "istio-system"]2019-08-13T05:27:03.480073Z warn istio.io/istio/pilot/pkg/serviceregistry/kube/controller.go:353: watch of *v1.Pod ended with: Namespaces Updated2019-08-13T05:27:03.480132Z info ServiceMeshMemberRoll default updated, namespaces now ["serving-tests" "knative-serving" "knative-eventing" "knative-build" "istio-system"]2019-08-13T05:27:03.480305Z warn istio.io/istio/pilot/pkg/serviceregistry/kube/controller.go:352: watch of *v1.Service ended with: Namespaces Updated2019-08-13T05:27:03.480330Z info ServiceMeshMemberRoll default updated, namespaces now ["serving-tests" "knative-serving" "knative-eventing" "knative-build" "istio-system"]2019-08-13T05:27:03.480390Z warn istio.io/istio/pilot/pkg/serviceregistry/kube/controller.go:360: watch of *v1.Endpoints ended with: Namespaces Updated2019-08-13T05:27:04.892061Z error istio.io/istio/pilot/pkg/serviceregistry/kube/controller.go:360: Failed to watch *v1.Endpoints: unknown (get endpoints)2019-08-13T05:27:05.090648Z error istio.io/istio/pilot/pkg/serviceregistry/kube/controller.go:352: Failed to watch *v1.Service: unknown (get services)2019-08-13T05:27:05.486557Z info Handling event update for pod autoscale-up-down-up-zzkjgjxz-k9m6z-deployment-6879974c-qs8xd in namespace serving-tests -> 10.131.2.312019-08-13T05:27:05.486603Z info Handling event update for pod autoscale-up-down-up-zzkjgjxz-k9m6z-deployment-6879974c-rpxpg in namespace serving-tests -> 10.128.4.212019-08-13T05:27:05.488367Z error istio.io/istio/pilot/pkg/serviceregistry/kube/controller.go:353: Failed to watch *v1.Pod: unknown (get pods)2019-08-13T05:28:12.783748Z info ServiceMeshMemberRoll default updated, namespaces now ["serving-tests" "knative-serving" "knative-eventing" "knative-build" "serving-tests-alt" "istio-system"]2019-08-13T05:28:12.783793Z info ServiceMeshMemberRoll default updated, namespaces now ["serving-tests" "knative-serving" "knative-eventing" "knative-build" "serving-tests-alt" "istio-system"]2019-08-13T05:28:12.783829Z info ServiceMeshMemberRoll default updated, namespaces now ["serving-tests" "knative-serving" "knative-eventing" "knative-build" "serving-tests-alt" "istio-system"]2019-08-13T05:30:57.286093Z warn istio.io/istio/pkg/kube/secretcontroller/secretcontroller.go:148: watch of *v1.Secret ended with: too old resource version: 255540 (288281)2019-08-13T05:31:34.552424Z info ads Push debounce stable[459] 1: 100.162531ms since last change, 100.162531ms since last push, full=true2019-08-13T05:31:34.552915Z info ads XDS: Pushing 2019-08-13T05:31:34Z/406 Services: 9, ConnectedEndpoints: 22019-08-13T05:31:34.553474Z info ads Cluster init time 541.857µs 2019-08-13T05:31:34Z/4062019-08-13T05:31:34.553581Z info ads Pushing router~10.131.2.9~istio-ingressgateway-bc97545d5-srx97.istio-system~istio-system.svc.cluster.local-532019-08-13T05:31:34.553589Z info ads PushAll done 2019-08-13T05:31:34Z/406 85.984µs2019-08-13T05:31:34.553664Z info ads Pushing router~10.128.2.7~cluster-local-gateway-67c8dc578f-mxfrj.istio-system~istio-system.svc.cluster.local-542019-08-13T05:31:34.553946Z info ads CDS: PUSH 2019-08-13T05:31:34Z/406 for router~10.128.2.7~cluster-local-gateway-67c8dc578f-mxfrj.istio-system~istio-system.svc.cluster.local-54 "10.128.2.7:34940", Clusters: 26, Services 92019-08-13T05:31:34.554017Z info ads CDS: PUSH 2019-08-13T05:31:34Z/406 for router~10.131.2.9~istio-ingressgateway-bc97545d5-srx97.istio-system~istio-system.svc.cluster.local-53 "10.131.2.9:44790", Clusters: 50, Services 92019-08-13T05:31:34.555144Z info ads LDS: PUSH for node:cluster-local-gateway-67c8dc578f-mxfrj.istio-system addr:"10.128.2.7:34940" listeners:1 9132019-08-13T05:31:34.555162Z info 1 error occurred:
      * gateway omitting listener "0.0.0.0_443" due to: must have more than 0 chains in listener: &v2.Listener{Name:"0.0.0.0_443", Address:core.Address{Address:(*core.Address_SocketAddress)(0xc000bf8f30), XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}, FilterChains:[]listener.FilterChain{}, UseOriginalDst:nil, PerConnectionBufferLimitBytes:nil, Metadata:(*core.Metadata)(nil), DeprecatedV1:(*v2.Listener_DeprecatedV1)(nil), DrainType:0, ListenerFilters:[]listener.ListenerFilter(nil), ListenerFiltersTimeout:(*time.Duration)(nil), Transparent:nil, Freebind:nil, SocketOptions:[]*core.SocketOption(nil), TcpFastOpenQueueLength:nil, XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}2019-08-13T05:31:34.555191Z warn constructed http route config for port 80 with no vhosts; Setting up a default 404 vhost2019-08-13T05:31:34.555240Z info ads ADS: RDS: PUSH for node: cluster-local-gateway-67c8dc578f-mxfrj.istio-system addr:10.128.2.7:34940 routes:12019-08-13T05:31:34.555246Z info ads LDS: PUSH for node:istio-ingressgateway-bc97545d5-srx97.istio-system addr:"10.131.2.9:44790" listeners:1 9132019-08-13T05:31:34.555534Z info ads ADS: RDS: PUSH for node: istio-ingressgateway-bc97545d5-srx97.istio-system addr:10.131.2.9:44790 routes:12019-08-13T05:31:34.555580Z info ads Push finished: 3.094323ms {    "ProxyStatus": {},    "Start": "2019-08-13T05:31:34.552482092Z",    "End": "2019-08-13T05:31:34.555548866Z"}
      
      

      By the way, it was not showing "myproject" in ServiceMeshMemberRoll even though it was defined in the config. This led me to conclusion that something's wrong with the istio-pilot. Restarting the istio-pilot helped and the logs newly showed "myproject" being part of ServiceMeshMemberRoll.

      The whole log is attached.

      istio-ingressgateway's logs only show this single line every 30 minutes many times (both before and after the moment when service become unavailable) and don't seem to show anything useful:

      [2019-08-13 07:35:19.641][18][warning][config] 
      [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86]
       gRPC config stream closed: 13,  

      Service Mesh configuration is minimal (sidecar injection, tracing etc. disabled) :

      apiVersion: maistra.io/v1
      kind: ServiceMeshControlPlane
      metadata:
        name: minimal-multitenant-cni-install
      spec:
        istio:
          global:
            multitenant: true
            proxy:
              # constrain resources for use in smaller environments
              resources:
                requests:
                  cpu: 100m
                  memory: 128Mi
                limits:
                  cpu: 500m
                  memory: 128Mi
              autoInject: disabled
            omitSidecarInjectorConfigMap: true
            disablePolicyChecks: false    
          istio_cni:
            enabled: true    
          gateways:
            istio-ingressgateway:
              autoscaleEnabled: false
            istio-egressgateway:
              enabled: false
            cluster-local-gateway:
              autoscaleEnabled: false
              enabled: true
              labels:
                app: cluster-local-gateway
                istio: cluster-local-gateway
              ports:
                - name: status-port
                  port: 15020
                - name: http2
                  port: 80
                  targetPort: 80
                - name: https
                  port: 443    
          mixer:
            enabled: false
            policy:
              enabled: false
            telemetry:
              enabled: false    
          pilot:
            # disable autoscaling for use in smaller environments
            autoscaleEnabled: false
            sidecar: false    
          kiali:
            enabled: false   
          tracing:
            enabled: false    
          prometheus:
            enabled: false    
          grafana:
            enabled: false    
          sidecarInjectorWebhook:
            enabled: false
      
      ---
      
      apiVersion: maistra.io/v1
      kind: ServiceMeshMemberRoll
      metadata:
        name: default
      spec:
        members:
        - myproject
        - serving-tests
        - serving-tests-alt
        - knative-serving
        - knative-eventing
        - knative-build
        - test-api-server-source
        - test-broker-channel-flow
        - test-broker-channel-flow-crd-in-memory
        - test-broker-channel-flow-in-memory
        - test-channel-chain
        - test-channel-chain-crd-in-memory
        - test-channel-chain-in-memory
        - test-container-source
        - test-cron-job-source
        - test-default-broker-with-many-triggers
        - test-event-transformation-for-subscription
        - test-event-transformation-for-subscription-crd-in-memory
        - test-event-transformation-for-subscription-in-memory
        - test-event-transformation-for-trigger
        - test-event-transformation-for-trigger-crd-in-memory
        - test-event-transformation-for-trigger-in-memory
        - test-single-binary-event-for-channel
        - test-single-binary-event-for-channel-crd-in-memory
        - test-single-binary-event-for-channel-in-memory
        - test-single-structured-event-for-channel
        - test-single-structured-event-for-channel-crd-in-memory
        - test-single-structured-event-for-channel-in-memory

        Gliffy Diagrams

          Attachments

          1. istio-ingressgateway.log
            4.52 MB
          2. istio-pilot.log
            244 kB
          3. istio-pilot.logs
            2.65 MB
          4. MAISTRA-833-maschmid-503-unknown-cluster.tar.gz
            801 kB
          5. servicemeshcontrolplane.yaml
            8 kB

            Issue Links

              Activity

                People

                • Assignee:
                  dgrimm Daniel Grimm
                  Reporter:
                  afield Alan Field
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  8 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: