Uploaded image for project: 'OpenShift Service Mesh'
  1. OpenShift Service Mesh
  2. OSSM-3389

[OSSM] local federation gateway ip polluting federated service endpoints

XMLWordPrintable

    • Icon: Ticket Ticket
    • Resolution: Done
    • Icon: Minor Minor
    • OSSM 2.3.3, OSSM 2.4.0
    • OSSM 2.3.1
    • Maistra
    • None
    • False
    • None
    • False

      What problem/issue/behavior are you having trouble with? What do you expect to see?
      We run OSSM operator 2.3.1 on Openshift 4.12

      We created a basic federation hello-world scenario very much like the one described in the doc (https://docs.openshift.com/container-platform/4.12/service_mesh/v2x/ossm-federation.html): red-mesh exporting a service to green-mesh via

      {Exported,Imported}

      ServiceSet. When trying to target the red service from a green pod, however, we experienced upstream connectivity issues on roughly 50% of the queries. Digging a bit more, we found that the ip address of the red-federation ingress LoadBalancer service (exposed on the green openshift cluster) somehow ends up among the target endpoints of the imported service in the envoy config on the red-federation egress:

      $ kubectl exec -ti egress-red-6d8f58bbcf-2wmr5 -n green-mesh-control-plane --  curl -s http://localhost:15000/clusters | grep exports.local | grep cx_active
      outbound|80||svc-istio-test.red-workload.svc.green-exports.local::10.56.142.126:15443::cx_active::0  <-- ip of green-federation ingress LB service running exposed red cluster. This one is expected.
      outbound|80||svc-istio-test.red-workload.svc.green-exports.local::10.56.142.131:15443::cx_active::0  <-- ip of red-federation ingress LB service exposed on green cluster. *This one is not expected.*
      

      We suspect this is coming from the fact that the federation service discovery query on the red istiod returns both the red and the green gateway ips:

       
      $ kubectl exec -ti istiod-red-749cd9cff9-xndtq -n red-mesh-control-plane -- curl --insecure https://localhost:8188/v1/services/green
      networkGatewayEndpoints":[{"port":15443,"hostname":"10.56.142.131"},
      {"port":15443,"hostname":"10.56.142.126"}],
      

      When we tried to activate federation:debug logs to debug this further, we ended up triggering a segfault in istiod:

      panic: runtime error: invalid memory address or nil pointer dereference168[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2a3adc1]169170goroutine 189 [running]:171istio.io/istio/pkg/servicemesh/federation/server.(*meshServer).serviceUpdated(0xc001ee6000, 0xc002381540, 0x1)172/remote-source/istio/app/pkg/servicemesh/federation/server/server.go:629

      probably caused by something like svcMessage == nil here https://github.com/maistra/istio/blob/maistra-2.3/pkg/servicemesh/federation/server/server.go#L628-L630

      the funny thing is that in this situation istiod cores once and when it restarts it won’t return any networkGatewayEndpoints in the service discovery query anymore. As a consequence, only the green-federation ingress endpoint will be listed in the envoy config on the red-federation egress:

      $ kubectl exec -ti egress-red-6d8f58bbcf-2wmr5 -n green-mesh-control-plane --  curl -s http://localhost:15000/clusters | grep exports.local | grep cx_active
      outbound|80||svc-istio-test.red-workload.svc.green-exports.local::10.56.142.126:15443::cx_active::0
       

      Is this a bug or expected behaviour?

      What is the business impact? Please also provide timeframe information.
      We are experimenting with OSSM with the intention of using it to potentially power new use-cases in production soon

      Note this is also captured in ticket: https://access.redhat.com/support/cases/#/case/03436386
      and doc https://docs.google.com/document/d/1673j_r6V1XXS-TNRsviCO9D_CT4nZQOmJRPz6h-TN8Q/edit

              _bmangoen Brian Mangoenpawiro
              rhn-support-sappleton Shaun Appleton
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: