Loading...

XML

Word

Printable

Type: Ticket
Resolution: Obsolete
Priority: Major
Fix Version/s: None
Affects Version/s: OSSM 2.3.1
Component/s: Maistra
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

What problem/issue/behavior are you having trouble with? What do you expect to see?
We run OSSM operator 2.3.1 on Openshift 4.12

We consider the same setup as support ticket 03436396, but with a DestinationRule with outlierDetection defined on the green mesh for the imported red service.
We performed the following steps:

1) scale the red webserver deployment to 2 replicas
2) start injecting traffic from a single green client pod onto the imported red service
3) we observe that the traffic is load-balanced 50/50 on the 2 red pods
4) we change the HTTP response code from 200 to 503 on only one red webserver pod #1

Now, when we reach past step 4, 100% of the queries start failing with "no healthy upstream". We believe this is due to the fact there is only one endpoint associated with the service - namely the red federation egress (we only have a single egress pod in our setup) - and this endpoint becomes unhealthy as soon as the query hits red webserver pod #1:

curl -s http://localhost:15000/clusters | grep svc-istio-test | grep health
outbound|80||svc-istio-test.red-workload.svc.red-imports.local::10.225.16.77:15443::health_flags::/failed_outlier_check

If this understanding is correct, then how is it possible to put in place proper resiliency on the federated cluster? Even if we had 100 red webserver replicas, it would be enough to have a single pod failing to circuit-break the whole remote cluster(!!) even though 99 red pods would be still perfectly fine and healthy. It looks like this could be mitigated by scaling up red federation egress pod replicas (in order to increase the number of endpoints going to the remote cluster), but for large deployments it would be impractical to match the same footprint of the remote deployment.

What is the business impact? Please also provide timeframe information.
We are experimenting with OSSM with the intention of using it to potentially power new use-cases in production soon

Note this is also captured in ticket: https://access.redhat.com/support/cases/#/case/03436411
and doc https://docs.google.com/document/d/1673j_r6V1XXS-TNRsviCO9D_CT4nZQOmJRPz6h-TN8Q/edit

relates to

OSSM-5916 Federation: A single unhealthy pod on remote cluster triggers circuit breaker

Closed

Assignee:: Eoin Fennessy

Reporter:: Shaun Appleton

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2023/02/14 10:01 AM

Updated:: 2024/11/04 7:37 PM

Resolved:: 2024/02/09 5:05 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates