Uploaded image for project: 'OpenShift Service Mesh'
  1. OpenShift Service Mesh
  2. OSSM-6381

Customer was facing issues with Istio routing after ArgoCD resync

XMLWordPrintable

    • Icon: Ticket Ticket
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • Customer Impact, Maistra
    • None
    • False
    • None
    • False

      Hello team,

      We have a customer that was facing issues in the Service Mesh control plane. They were unable to reach their services through the ingressgateway.

      They shared with us that they had their cluster disconected from ArgoCD, the Service Mesh Operator upgraded from 2.2.0 to 2.2.1 the day April 22th and they reconnected ArgoCD to the cluster on April 24th, causing a production down issue.

      The ingress gateway logs were showing this errors:

      2024-04-24T13:10:53.204121218Z [2024-04-24T13:10:22.748Z] "GET / HTTP/1.1" 503 UF,URX upstream_reset_before_response_started{connection_failure} - "-" 0 91 30025 - "194.53.160.253,10.131.4.240" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" "095d9070-9f11-49c8-bbe0-88b18e0e1296" "kv-aanvraag-ops-prd.web.liander.nl" "10.128.6.22:8080" outbound|8080||kv-aanvraag-ops.reboot-prd.svc.cluster.local - 10.130.4.249:8080 10.131.4.240:55352 - -
      2024-04-24T13:10:56.203492891Z [2024-04-24T13:10:25.922Z] "GET /favicon.ico HTTP/1.1" 503 UF,URX upstream_reset_before_response_started{connection_failure} - "-" 0 91 30055 - "194.53.161.150,10.131.1.27" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0" "59234875-e420-4628-9c98-6bfdea2ed0a4" "kvi-prd.web.liander.nl" "10.131.4.62:8080" outbound|8080||kvi.reboot-prd.svc.cluster.local - 10.130.4.249:8080 10.131.1.27:53164 - -

      We help them to recreate the control plane following this steps:

       

      • The old control plane ( reboot-ossm-prd ) failed to work causing "istio routing" issues.
      • To test the hypothesis whether this is a pod problem we tried to DELETE the istiod + istio-ingressgateway pods but that did not work.
      • With istioctl tool we checked all Envoy Information ( listeners , routes , cluster, endpoints ) and everything seem to be okay.
      • We then created a "temporary" control plane with the same SMCP and then onboarded the members and once we did that it worked.

      Now the issue is solved, but the customer is looking for a RCA in order to review what's the cause and what they should do in order to avoid production down impact in the future. They are open to consider disconnect the ArgoCD form the Service Mesh configuration or block the automatic updates of the Service Mesh Operator.

      We have collected a must-gather during the issue for this revision. A example of this errors can be shown in this namespaces:

      reboot-ossm-prd
      reboot-prd

       

      Please, let us know if you need any additional information or if we can help further with the investigation.

      Thank you for your cooperation and best regards.

      must-gather.aa
      must-gather.ab

        1. must-gather.aa
          200.00 MB
        2. must-gather.ab
          180.54 MB

            _bmangoen Brian Mangoenpawiro
            rhn-support-hmorella Hugo Morella Soler
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: