Loading...

XML

Word

Printable

Type: Ticket
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: Customer Impact, Maistra
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:

Hello team,

We have a customer that was facing issues in the Service Mesh control plane. They were unable to reach their services through the ingressgateway.

They shared with us that they had their cluster disconected from ArgoCD, the Service Mesh Operator upgraded from 2.2.0 to 2.2.1 the day April 22th and they reconnected ArgoCD to the cluster on April 24th, causing a production down issue.

The ingress gateway logs were showing this errors:

2024-04-24T13:10:53.204121218Z [2024-04-24T13:10:22.748Z] "GET / HTTP/1.1" 503 UF,URX upstream_reset_before_response_started{connection_failure} - "-" 0 91 30025 - "194.53.160.253,10.131.4.240" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" "095d9070-9f11-49c8-bbe0-88b18e0e1296" "kv-aanvraag-ops-prd.web.liander.nl" "10.128.6.22:8080" outbound|8080||kv-aanvraag-ops.reboot-prd.svc.cluster.local - 10.130.4.249:8080 10.131.4.240:55352 - -
2024-04-24T13:10:56.203492891Z [2024-04-24T13:10:25.922Z] "GET /favicon.ico HTTP/1.1" 503 UF,URX upstream_reset_before_response_started{connection_failure} - "-" 0 91 30055 - "194.53.161.150,10.131.1.27" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0" "59234875-e420-4628-9c98-6bfdea2ed0a4" "kvi-prd.web.liander.nl" "10.131.4.62:8080" outbound|8080||kvi.reboot-prd.svc.cluster.local - 10.130.4.249:8080 10.131.1.27:53164 - -

We help them to recreate the control plane following this steps:

The old control plane ( reboot-ossm-prd ) failed to work causing "istio routing" issues.
To test the hypothesis whether this is a pod problem we tried to DELETE the istiod + istio-ingressgateway pods but that did not work.
With istioctl tool we checked all Envoy Information ( listeners , routes , cluster, endpoints ) and everything seem to be okay.
We then created a "temporary" control plane with the same SMCP and then onboarded the members and once we did that it worked.

Now the issue is solved, but the customer is looking for a RCA in order to review what's the cause and what they should do in order to avoid production down impact in the future. They are open to consider disconnect the ArgoCD form the Service Mesh configuration or block the automatic updates of the Service Mesh Operator.

We have collected a must-gather during the issue for this revision. A example of this errors can be shown in this namespaces:

reboot-ossm-prd
reboot-prd

Please, let us know if you need any additional information or if we can help further with the investigation.

Thank you for your cooperation and best regards.

must-gather.aa
must-gather.ab

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

must-gather.aa
200.00 MB
2024/04/25 2:07 PM
must-gather.ab
180.54 MB
2024/04/25 2:07 PM

Assignee:: Brian Mangoenpawiro

Reporter:: Hugo Morella Soler

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/04/25 2:13 PM

Updated:: 2024/05/02 12:00 PM

Details

Description

Attachments

Attachments

Activity

People

Dates