-
Bug
-
Resolution: Done
-
Blocker
-
OSSM 2.3.0
-
False
-
None
-
False
-
Release Notes
-
Done
-
Sprint 57, Sprint 58 - week 1
Xns-informers has a critical design flaw that causes its informers to return true on informer.HasSynced() calls before the underlying per-namespace informers have synced. This causes the controllers in Istiod to run prematurely, causing them to operate on an empty cache. The controller thus sees the wrong state and reconfigures all the proxies with invalid configs.
For example, imagine the cluster contains several VirtualServices. During steady-state operation of istiod, the proxies are configured correctly. However, when istiod restarts, if the controllers start before the cache is synced, they find no VirtualServices, and send out an invalid config to the proxies (removing the virtual services). Soon afterwards, the cache is synced and the controllers now see the VirtualServices and update the proxy config again.
The following log shows a similar example, where istiod thinks there are no endpoints for bookinfo services (of course, those services actually do have endpoints):
... info kube controller "networking.istio.io/v1alpha3/VirtualService" is syncing... ... info Work item handle failed (waiting till full synchronization), retry after delay 1s ... info kube controller "networking.istio.io/v1alpha3/VirtualService" is syncing... ... info Work item handle failed (waiting till full synchronization), retry after delay 1s ... info ads Incremental push, service reviews.bookinfo.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints ... info ads Incremental push, service details.bookinfo.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints ... info ads Incremental push, service ratings.bookinfo.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints ... info ads Incremental push, service productpage.bookinfo.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints ... info kube controller "security.istio.io/v1beta1/PeerAuthentication" is syncing... ... info Work item handle failed (waiting till full synchronization), retry after delay 1s ... info ads Incremental push, service istio-egressgateway.istio-system.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints ... info kube controller "networking.istio.io/v1alpha3/EnvoyFilter" is syncing... ... info Work item handle failed (waiting till full synchronization), retry after delay 1s ... info ads Incremental push, service istiod-minimal.istio-system.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints ... info ads Incremental push, service istio-ingressgateway.istio-system.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints ... info smmr Cache synced for listener "ior" ... info klog Waiting for caches to sync for Gateways ... info klog Caches are synced for Gateways ... info kube controller "networking.istio.io/v1alpha3/EnvoyFilter" is syncing... ... info Work item handle failed (waiting till full synchronization), retry after delay 1s ... info kube controller "networking.istio.io/v1alpha3/EnvoyFilter" is syncing... ... info Work item handle failed (waiting till full synchronization), retry after delay 1s ... info kube controller "networking.istio.io/v1alpha3/EnvoyFilter" is syncing... ... info Work item handle failed (waiting till full synchronization), retry after delay 1s ... info kube controller "networking.istio.io/v1alpha3/EnvoyFilter" is syncing... ... info Work item handle failed (waiting till full synchronization), retry after delay 1s ... info kube controller "networking.istio.io/v1alpha3/EnvoyFilter" is syncing... ... info Work item handle failed (waiting till full synchronization), retry after delay 1s ... info ads Full push, new service istio-system/istio-ingressgateway.istio-system.svc.cluster.local ... info ads Full push, new service istio-system/istio-egressgateway.istio-system.svc.cluster.local ... info ads Full push, new service bookinfo/reviews.bookinfo.svc.cluster.local ... info ads Full push, new service bookinfo/ratings.bookinfo.svc.cluster.local ... info ads Full push, new service bookinfo/productpage.bookinfo.svc.cluster.local ... info ads Full push, new service bookinfo/details.bookinfo.svc.cluster.local ... info ads Push debounce stable[1] 66 for config ServiceEntry/bookinfo/ratings.bookinfo.svc.cluster.local and 50 more configs: 100.497417ms since last change, 167.057ms since last push, full=true ... info ads XDS: Pushing:2022-09-16T09:48:10+02:00/1 Services:7 ConnectedEndpoints:0 Version:2022-09-16T09:48:10+02:00/1
OSSM-1949 is directly caused by this bug, because the ior controller runs before the gateway cache is populated, causing it to delete all routes because it thinks there are no gateways.
I've also seen many other strange warnings/errors during startup, such as istiod not finding some configmap or secret. It appears all these issues have the same underlying cause.
The design flaw in xns-informers is that during startup, before the memberroll controller within istiod initializes the list of member namespaces in the informer factory, the multi-namespace informer's HasSynced() function returns true, because the list of per-namespace informers is still empty (see code: [1]). It's clear that the multi-namespace informer should report being synced only after the list of namespaces (and thus the underlying per-namespace informer list) is initialized and all the informers are synced.
[1] https://github.com/maistra/xns-informer/blob/maistra-2.3/pkg/informers/informer.go#L264-L275
- is cloned by
-
OSSM-2068 Istiod controllers start before caches are synced [maistra 2.1, 2.2]
- Closed
- mentioned on