Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-13599

OSD clusters' Ingress health checks & routes fail after swapping application router between public and private


      This is a clone of issue OCPBUGS-13598. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-6013. The following is the description of the original issue:

      Description of problem:

      When utilizing the OSD "Edit Cluster Ingress" feature to change the default application router from public to private or vice versa, the external AWS load balancer is removed an replaced by the cloud-ingress-operator.
      When this happens, the external load balancer health checks never receive a successful check from the backend nodes, and all nodes are marked out-of-service.
      Cluster operators depending on *.apps.CLUSTERNAME.devshift.org begin to fail, initially with DNS errors, which is expected, but then with EOF messages attempting to get the routes associated with their health checks, eg: 
      OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.chcollin-mjtj.cvgo.s1.devshift.org/healthz": EOF
      This always degrades the authentication, console and ingress (via ingress-canary) operators.
      Logs from the `ovnkube-node-*` pods for the instance show VN properly updating the port for the endpoint healthcheck to that of the new port in use by the AWS LB.
      The endpointSlices for the endpoint are updated/replaced, but with no change in config as far as I can tell.  They're just recreated.
      The service backending the router-default pods has the proper HealthCheckNodePort configuration, matching the new AWS LB.
      Curling the service via the CLUSTER_IP:NODE_PORT_HEALTH_CHECK/healthz results in a connection time out.
      Curling the local health check for HAPROXY within the router-default pod via `localhost:1936/healthz` results in an OK response as expected.
      After rolling the router-default pods manually with `oc rollout restart deployment router-default -n openshift-ingress`, or just deleting the pods, the cluster ends up healing, with the AWS LB seeing the backend infra nodes in service again, and cluster operators depending on the *apps.CLUSTERNAME.devshift.org domain healing on their own as well.
      I'm unsure if this should go to network-ovn or network-multis (or some other component), so I'm starting here.  Please redirect me if necessary.


      Version-Release number of selected component (if applicable):


      How reproducible:


      Steps to Reproduce:

      1. Login to the OCM console for the cluster (eg: https://qaprodauth.console.redhat.com/openshift for staging)
      2. From the network tab, select "Edit Cluster Ingress"
      3. Check or uncheck the "Make Router Private" box for the default application router - it does not matter which way you're swapping.

      Actual results:

      Ingress to the default router begins to fail for the *.apps routes; never becomes available

      Expected results:

      Ingress would fail for ~15 minutes as things are reconfigured, and then become available again.

      Additional info:

      Two must-gathers are available via Google drive https://drive.google.com/drive/u/1/folders/1oIkNOSY0R9Mvo-BZ1Pa3W3iDDfF_726F and shared with Red Hat employees, from a test cluster I created .  The first is from before the change, and the second is from after the change.  This is on a brand new cluster, so logs should be clean-ish.

            sseethar Surya Seetharaman
            openshift-crt-jira-prow OpenShift Prow Bot
            Arti Sood Arti Sood
            0 Vote for this issue
            9 Start watching this issue