[RFE-6835] Develop a fallback test against the cluster canary route to identify issues in the external load balancer - Red Hat Issue Tracker

Type: Feature Request
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: Network Edge
Labels:
None

Work Type:
Improvement
Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

1. Proposed title of this feature request

Develop an fallback test against the cluster canary route to identify issues in the external load balancer

2. What is the nature and description of the request?

The available log messages in the ingress operator application don't specify if the issue is related to external load balancer or if there is any critical issue happening with the router-default pods deployed in the cluster. We can see it in the probeRouteEndpoint function present in the ingress operator code. This test only performs probes against the OCP routes, if the route is unavailable due to any external issue (DNS, network outage, load balancer, etc), this is not clear in the messages.

Additionally, it is very common that Openshift administrators interpret the issue with the cluster operators caused by any external outage as an OCP problem. With some curl tests, we are able to determine that the OCP platform is working normally, but it is being affected any external problem that is affecting some cluster operators as they depend on load balancer, external network and DNS layer.

A possible improvement for that should have a new implementation where in case of failed http connection against the canary route, new tests are initiated against the routes pods directly to ensure that the traffic is working properly at least internally in the cluster. This is going to make clear that the issue is not in the OCP, but in any external component.

A second possibility is creating alerts where the routes are tested against the router-default pods to determine if the issue is internal or external

Very common issues in the external network that affects the cluster operators:

NAT issues in the load balancer side
Unavailability in the load balancer side
Misconfigured load balancer algorithms or type of implementation from load balancer
Router node IPs not properly configured in the load balancer
Health checks not well configured

More ideas can be implemented as well.

3. Why does the customer need this? (List the business requirements here)

Decrease quantity of customer case related to issues with external load balancer
OCP administrators can identifies fast which component is failing
Improvements on OCP product perspective where issues can identify faster the failures and the responsible team can be involved faster
Faster analisys from Red Hat support as the failed component is already specified. This also means less time of outage for customers in case of critical problems

4. List any affected packages or components.

Authentication/console and ingress cluster operators
Ingress operator
IngressController

Assignee:: Marc Curry

Reporter:: Bruno Gomes

Votes:: 8 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2024/12/06 5:09 PM

Updated:: 2025/02/04 3:28 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide