Uploaded image for project: 'OpenShift Service Mesh'
  1. OpenShift Service Mesh
  2. OSSM-2006

Istiod controllers start before caches are synced

XMLWordPrintable

    • Sprint 57, Sprint 58 - week 1

      Xns-informers has a critical design flaw that causes its informers to return true on informer.HasSynced() calls before the underlying per-namespace informers have synced. This causes the controllers in Istiod to run prematurely, causing them to operate on an empty cache. The controller thus sees the wrong state and reconfigures all the proxies with invalid configs.

      For example, imagine the cluster contains several VirtualServices. During steady-state operation of istiod, the proxies are configured correctly. However, when istiod restarts, if the controllers start before the cache is synced, they find no VirtualServices, and send out an invalid config to the proxies (removing the virtual services). Soon afterwards, the cache is synced and the controllers now see the VirtualServices and update the proxy config again.

      The following log shows a similar example, where istiod thinks there are no endpoints for bookinfo services (of course, those services actually do have endpoints):

      ... info    kube    controller "networking.istio.io/v1alpha3/VirtualService" is syncing...
      ... info    Work item handle failed (waiting till full synchronization), retry after delay 1s
      ... info    kube    controller "networking.istio.io/v1alpha3/VirtualService" is syncing...
      ... info    Work item handle failed (waiting till full synchronization), retry after delay 1s
      ... info    ads     Incremental push, service reviews.bookinfo.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints
      ... info    ads     Incremental push, service details.bookinfo.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints
      ... info    ads     Incremental push, service ratings.bookinfo.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints
      ... info    ads     Incremental push, service productpage.bookinfo.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints
      ... info    kube    controller "security.istio.io/v1beta1/PeerAuthentication" is syncing...
      ... info    Work item handle failed (waiting till full synchronization), retry after delay 1s
      ... info    ads     Incremental push, service istio-egressgateway.istio-system.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints
      ... info    kube    controller "networking.istio.io/v1alpha3/EnvoyFilter" is syncing...
      ... info    Work item handle failed (waiting till full synchronization), retry after delay 1s
      ... info    ads     Incremental push, service istiod-minimal.istio-system.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints
      ... info    ads     Incremental push, service istio-ingressgateway.istio-system.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints
      ... info    smmr    Cache synced for listener "ior"
      ... info    klog    Waiting for caches to sync for Gateways
      ... info    klog    Caches are synced for Gateways 
      ... info    kube    controller "networking.istio.io/v1alpha3/EnvoyFilter" is syncing...
      ... info    Work item handle failed (waiting till full synchronization), retry after delay 1s
      ... info    kube    controller "networking.istio.io/v1alpha3/EnvoyFilter" is syncing...
      ... info    Work item handle failed (waiting till full synchronization), retry after delay 1s
      ... info    kube    controller "networking.istio.io/v1alpha3/EnvoyFilter" is syncing...
      ... info    Work item handle failed (waiting till full synchronization), retry after delay 1s
      ... info    kube    controller "networking.istio.io/v1alpha3/EnvoyFilter" is syncing...
      ... info    Work item handle failed (waiting till full synchronization), retry after delay 1s
      ... info    kube    controller "networking.istio.io/v1alpha3/EnvoyFilter" is syncing...
      ... info    Work item handle failed (waiting till full synchronization), retry after delay 1s
      ... info    ads     Full push, new service istio-system/istio-ingressgateway.istio-system.svc.cluster.local
      ... info    ads     Full push, new service istio-system/istio-egressgateway.istio-system.svc.cluster.local
      ... info    ads     Full push, new service bookinfo/reviews.bookinfo.svc.cluster.local
      ... info    ads     Full push, new service bookinfo/ratings.bookinfo.svc.cluster.local
      ... info    ads     Full push, new service bookinfo/productpage.bookinfo.svc.cluster.local
      ... info    ads     Full push, new service bookinfo/details.bookinfo.svc.cluster.local
      ... info    ads     Push debounce stable[1] 66 for config ServiceEntry/bookinfo/ratings.bookinfo.svc.cluster.local and 50 more configs: 100.497417ms since last change, 167.057ms since last push, full=true
      ... info    ads     XDS: Pushing:2022-09-16T09:48:10+02:00/1 Services:7 ConnectedEndpoints:0 Version:2022-09-16T09:48:10+02:00/1
      

      OSSM-1949 is directly caused by this bug, because the ior controller runs before the gateway cache is populated, causing it to delete all routes because it thinks there are no gateways.

      I've also seen many other strange warnings/errors during startup, such as istiod not finding some configmap or secret. It appears all these issues have the same underlying cause.

      The design flaw in xns-informers is that during startup, before the memberroll controller within istiod initializes the list of member namespaces in the informer factory, the multi-namespace informer's HasSynced() function returns true, because the list of per-namespace informers is still empty (see code: [1]). It's clear that the multi-namespace informer should report being synced only after the list of namespaces (and thus the underlying per-namespace informer list) is initialized and all the informers are synced.

      [1] https://github.com/maistra/xns-informer/blob/maistra-2.3/pkg/informers/informer.go#L264-L275

            mluksa@redhat.com Marko Luksa
            mluksa@redhat.com Marko Luksa
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: