[OCPBUGS-6013] OSD clusters' Ingress health checks & routes fail after swapping application router between public and private - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Networking / ovn-kubernetes
Labels:

Test Coverage:

+
Severity:
Critical
Regression:
None
Sprint:
SDN Sprint 231, SDN Sprint 232, SDN Sprint 233, SDN Sprint 234, SDN Sprint 236
sprint_count:
5
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

When utilizing the OSD "Edit Cluster Ingress" feature to change the default application router from public to private or vice versa, the external AWS load balancer is removed an replaced by the cloud-ingress-operator.

When this happens, the external load balancer health checks never receive a successful check from the backend nodes, and all nodes are marked out-of-service.

Cluster operators depending on *.apps.CLUSTERNAME.devshift.org begin to fail, initially with DNS errors, which is expected, but then with EOF messages attempting to get the routes associated with their health checks, eg: 

OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.chcollin-mjtj.cvgo.s1.devshift.org/healthz": EOF

This always degrades the authentication, console and ingress (via ingress-canary) operators.

Logs from the `ovnkube-node-*` pods for the instance show VN properly updating the port for the endpoint healthcheck to that of the new port in use by the AWS LB.

The endpointSlices for the endpoint are updated/replaced, but with no change in config as far as I can tell.  They're just recreated.

The service backending the router-default pods has the proper HealthCheckNodePort configuration, matching the new AWS LB.

Curling the service via the CLUSTER_IP:NODE_PORT_HEALTH_CHECK/healthz results in a connection time out.

Curling the local health check for HAPROXY within the router-default pod via `localhost:1936/healthz` results in an OK response as expected.

After rolling the router-default pods manually with `oc rollout restart deployment router-default -n openshift-ingress`, or just deleting the pods, the cluster ends up healing, with the AWS LB seeing the backend infra nodes in service again, and cluster operators depending on the *apps.CLUSTERNAME.devshift.org domain healing on their own as well.

I'm unsure if this should go to network-ovn or network-multis (or some other component), so I'm starting here.  Please redirect me if necessary.

Version-Release number of selected component (if applicable):

How reproducible:

100%

Steps to Reproduce:

1. Login to the OCM console for the cluster (eg: https://qaprodauth.console.redhat.com/openshift for staging)
2. From the network tab, select "Edit Cluster Ingress"
3. Check or uncheck the "Make Router Private" box for the default application router - it does not matter which way you're swapping.

Actual results:

Ingress to the default router begins to fail for the *.apps routes; never becomes available

Expected results:

Ingress would fail for ~15 minutes as things are reconfigured, and then become available again.

Additional info:

Two must-gathers are available via Google drive https://drive.google.com/drive/u/1/folders/1oIkNOSY0R9Mvo-BZ1Pa3W3iDDfF_726F and shared with Red Hat employees, from a test cluster I created .  The first is from before the change, and the second is from after the change.  This is on a brand new cluster, so logs should be clean-ish.

blocks

OCPBUGS-13598 OSD clusters' Ingress health checks & routes fail after swapping application router between public and private

Closed

is cloned by

OCPBUGS-13598 OSD clusters' Ingress health checks & routes fail after swapping application router between public and private

Closed

is related to

OCPBUGS-2554 ingress, authentication and console operator goes to degraded after switching default application router scope

Closed

links to

[Upstream Fix] Call SyncEndpoint from AddService

KCS

openshift/ovn-kubernetes#1671: OCPBUGS-6013: Call SyncEndpoints from AddService

RHEA-2023:5006 rpm

(2 links to)

Assignee:: Surya Seetharaman

Reporter:: Chris Collins

QA Contact:: Arti Sood

Votes:: 0 Vote for this issue

Watchers:: 28 Start watching this issue

Created:: 2023/01/18 10:06 PM

Updated:: 2024/04/29 5:11 PM

Resolved:: 2023/10/31 12:56 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates