Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: maistra-1.0.2
Affects Version/s: maistra-1.0.0
Component/s: None
Labels:
None

Git Pull Request:
https://github.com/Maistra/istio/pull/49

Sprint:
MAISTRA 1.0.2

SFDC Cases Counter:
SFDC Cases Links:

Description

Originally reported by maschmid@redhat.com as a comment on ~~MAISTRA-833~~:

On my cluster I am reproducing a similar problem,

I also get 503 from ingressgatway, ,in my case the ingressgateway logs contains "unknown cluster outbound|80||autoscale-go-ljzb2.myproject.svc.cluster.local"

[2019-08-28 11:53:39.001][27][debug][router] [external/envoy/source/common/router/router.cc:308] [C9047][S6470669494928115903] unknown cluster 'outbound|80||autoscale-go-ljzb2.myproject.svc.cluster.local'
[2019-08-28 11:53:39.001][27][debug][filter] [src/envoy/http/mixer/filter.cc:133] Called Mixer::Filter : encodeHeaders 2
[2019-08-28 11:53:39.001][27][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1322] [C9047][S6470669494928115903] encoding headers via codec (end_stream=true):
':status', '503'
'date', 'Wed, 28 Aug 2019 11:53:38 GMT'
'server', 'istio-envoy'

Note that autoscale-go-ljzb2 is indeed something that doesn't exist anymore. It is currently named autoscale-go-2zjgp

 oc get services -n myproject
NAME                       TYPE           CLUSTER-IP       EXTERNAL-IP                                           PORT(S)             AGE
autoscale-go               ExternalName   <none>           istio-ingressgateway.istio-system.svc.cluster.local   <none>              21m
autoscale-go-2zjgp         ClusterIP      172.30.252.247   <none>                                                80/TCP              22m
autoscale-go-2zjgp-dpghl   ClusterIP      172.30.110.171   <none>                                                9090/TCP,9091/TCP   22m
autoscale-go-2zjgp-hv68c   ClusterIP      172.30.23.93     <none>

Grepping the ingressgateway config dump shows that the config contains both the new and some old version [...]

Marek's logs are attached to this issue. I checked the operator logs on the cluster itself, there was nothing out of the ordinary.

I investigated a bit and found the following additional information:

Istio configuration was accurate
the route information in the envoy config was outdated and still pointing to the old, non-existant service
cluster information was up-to-date
pilot was still pushing config updates
Galley reported much fewer "Underlying Result Channel closed" messages (one every few seconds before incident, one every few minutes afterwards)
Just prior to the error, two namespaces were deleted in quick succession
a test VirtualService I created never ended up in the envoy configuration
restarting Galley fixed the issue

Current theory:

Just prior to the error, two namespaces were deleted in quick succession
Galley started re-creating watches after the first deletion had propagated
As the second namespace was already being terminated, it failed to create a bunch of watches
Apparently, it was never able to recover/ recreate the missing watches
As a consequence, Pilot would never receive resource updates for Istio objects
The ingress-gateway would never receive proper Route information. The clusters were created, though, as they are built from pilot's own kubernetes watches

I'm currently trying to reproduce this to find out if this can be traced to e.g. the galley/pkg/source/kube or MultiNamespaceListerWatcher code

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

envoy_config_dump.json
128 kB
2019/08/28 12:35 PM
istio-citadel-56879889cf-87d6r.logs
182 kB
2019/08/28 11:02 AM
istio-galley-696b7f8f88-4fndg.logs
2.10 MB
2019/08/28 11:02 AM
istio-ingressgateway-56d5f5cddd-c8qmb.config_dump
121 kB
2019/08/28 11:02 AM
istio-ingressgateway-56d5f5cddd-c8qmb.logs
4.04 MB
2019/08/28 11:02 AM
istio-pilot-b48d9d654-fnr5k.discovery.logs
5.18 MB
2019/08/28 11:02 AM

Issue Links

relates to

MAISTRA-833 istio-pilot pod has to be restarted periodically to keep Knative services functional

Closed

Activity

People

Assignee:: Daniel Grimm

Reporter:: Daniel Grimm

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2019/08/28 11:03 AM

Updated:: 2021/10/24 6:21 AM

Resolved:: 2019/10/23 9:30 AM