-
Bug
-
Resolution: Done
-
Major
-
maistra-1.0.0
-
None
-
None
Originally reported by maschmid@redhat.com as a comment on MAISTRA-833:
On my cluster I am reproducing a similar problem,
I also get 503 from ingressgatway, ,in my case the ingressgateway logs contains "unknown cluster outbound|80||autoscale-go-ljzb2.myproject.svc.cluster.local"
[2019-08-28 11:53:39.001][27][debug][router] [external/envoy/source/common/router/router.cc:308] [C9047][S6470669494928115903] unknown cluster 'outbound|80||autoscale-go-ljzb2.myproject.svc.cluster.local' [2019-08-28 11:53:39.001][27][debug][filter] [src/envoy/http/mixer/filter.cc:133] Called Mixer::Filter : encodeHeaders 2 [2019-08-28 11:53:39.001][27][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1322] [C9047][S6470669494928115903] encoding headers via codec (end_stream=true): ':status', '503' 'date', 'Wed, 28 Aug 2019 11:53:38 GMT' 'server', 'istio-envoy'
Note that autoscale-go-ljzb2 is indeed something that doesn't exist anymore. It is currently named autoscale-go-2zjgp
oc get services -n myproject NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE autoscale-go ExternalName <none> istio-ingressgateway.istio-system.svc.cluster.local <none> 21m autoscale-go-2zjgp ClusterIP 172.30.252.247 <none> 80/TCP 22m autoscale-go-2zjgp-dpghl ClusterIP 172.30.110.171 <none> 9090/TCP,9091/TCP 22m autoscale-go-2zjgp-hv68c ClusterIP 172.30.23.93 <none>
Grepping the ingressgateway config dump shows that the config contains both the new and some old version [...]
Marek's logs are attached to this issue. I checked the operator logs on the cluster itself, there was nothing out of the ordinary.
I investigated a bit and found the following additional information:
- Istio configuration was accurate
- the route information in the envoy config was outdated and still pointing to the old, non-existant service
- cluster information was up-to-date
- pilot was still pushing config updates
- Galley reported much fewer "Underlying Result Channel closed" messages (one every few seconds before incident, one every few minutes afterwards)
- Just prior to the error, two namespaces were deleted in quick succession
- a test VirtualService I created never ended up in the envoy configuration
- restarting Galley fixed the issue
Current theory:
- Just prior to the error, two namespaces were deleted in quick succession
- Galley started re-creating watches after the first deletion had propagated
- As the second namespace was already being terminated, it failed to create a bunch of watches
- Apparently, it was never able to recover/ recreate the missing watches
- As a consequence, Pilot would never receive resource updates for Istio objects
- The ingress-gateway would never receive proper Route information. The clusters were created, though, as they are built from pilot's own kubernetes watches
I'm currently trying to reproduce this to find out if this can be traced to e.g. the galley/pkg/source/kube or MultiNamespaceListerWatcher code
- relates to
-
MAISTRA-833 istio-pilot pod has to be restarted periodically to keep Knative services functional
- Closed