Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2554

ingress, authentication and console operator goes to degraded after switching default application router scope

    XMLWordPrintable

Details

    • SDN Sprint 227, SDN Sprint 228
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • NA
    • Bug Fix

    Description

      Description of problem:
      Switching the spec.endpointPublishingStrategy.loadBalancer.scope of the default ingresscontroller results in a degraded ingress operator. The routes using that endpoint like the console URL become inaccessible.
      Degraded operators after scope change:

      $ oc get co | grep -v ' True        False         False'
      NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.11.4    False       False         True       72m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.kartrosa.ukld.s1.devshift.org/healthz": EOF
      console                                    4.11.4    False       False         False      72m     RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.kartrosa.ukld.s1.devshift.org): Get "https://console-openshift-console.apps.kartrosa.ukld.s1.devshift.org": EOF
      ingress                                    4.11.4    True        False         True       65m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
      
      

      We have noticed that each time this happens the underlying AWS loadbalancer gets recreated which is as expected however the router pods probably do not get notified about the new loadbalancer. The instances in the new loadbalancer become 'outOfService'.

      Restarting one of the router pods fixes the issue and brings back a couple of instances under the loadbalancer back to 'InService' which leads to the operators becoming happy again.

      Version-Release number of selected component (if applicable):

      ingress in 4.11.z however we suspect this issue to also apply to older versions
      

      How reproducible:

      Consistently reproducible
      

      Steps to Reproduce:

      1. Create a test OCP 4.11 cluster in AWS
      2. Switch the spec.endpointPublishingStrategy.loadBalancer.scope of the default ingresscontroller in openshift-ingress-operator to Internal from External (or vice versa)
      3. New Loadbalancer is created in AWS for the default router service, however the instances behind are not in service
      
      

      Actual results:

      ingress, authentication and console operators go into a degraded state. Console URL of the cluster is inaccessible
      

      Expected results:

      The ingresscontroller scope transition from internal->External (or vice versa) is smooth without any downtime or operators going into degraded state. The console is accessible.
      

       

      Attachments

        Issue Links

          Activity

            People

              mmahmoud@redhat.com Mohamed Mahmoud
              kramraja.openshift Karthik Perumal
              Hongan Li Hongan Li
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: