-
Feature Request
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
Improvement
-
False
-
None
-
False
-
Not Selected
-
-
1. Proposed title of this feature request
Develop an fallback test against the cluster canary route to identify issues in the external load balancer
2. What is the nature and description of the request?
The available log messages in the ingress operator application don't specify if the issue is related to external load balancer or if there is any critical issue happening with the router-default pods deployed in the cluster. We can see it in the probeRouteEndpoint function present in the ingress operator code. This test only performs probes against the OCP routes, if the route is unavailable due to any external issue (DNS, network outage, load balancer, etc), this is not clear in the messages.
Additionally, it is very common that Openshift administrators interpret the issue with the cluster operators caused by any external outage as an OCP problem. With some curl tests, we are able to determine that the OCP platform is working normally, but it is being affected any external problem that is affecting some cluster operators as they depend on load balancer, external network and DNS layer.
A possible improvement for that should have a new implementation where in case of failed http connection against the canary route, new tests are initiated against the routes pods directly to ensure that the traffic is working properly at least internally in the cluster. This is going to make clear that the issue is not in the OCP, but in any external component.
A second possibility is creating alerts where the routes are tested against the router-default pods to determine if the issue is internal or external
Very common issues in the external network that affects the cluster operators:
- NAT issues in the load balancer side
- Unavailability in the load balancer side
- Misconfigured load balancer algorithms or type of implementation from load balancer
- Router node IPs not properly configured in the load balancer
- Health checks not well configured
More ideas can be implemented as well.
3. Why does the customer need this? (List the business requirements here)
- Decrease quantity of customer case related to issues with external load balancer
- OCP administrators can identifies fast which component is failing
- Improvements on OCP product perspective where issues can identify faster the failures and the responsible team can be involved faster
- Faster analisys from Red Hat support as the failed component is already specified. This also means less time of outage for customers in case of critical problems
4. List any affected packages or components.
- Authentication/console and ingress cluster operators
- Ingress operator
- IngressController