Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-17763

Both authentication and console operators show "failed to GET route" context deadline exceeded after installation on Nutanix while the routes are indeed reachable from pods of some masters but not reachable from pods of other master

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Undefined
    • None
    • 4.12.z, 4.11.z, 4.14
    • Installer / Nutanix
    • Important
    • No
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      'OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.skordas-15b.qe.devcluster.openshift.com/healthz":
            context deadline exceeded (Client.Timeout exceeded while awaiting headers)'
      
      after installation on Nutanix

      Version-Release number of selected component (if applicable):

      4.11.46

      How reproducible:

      50% So far 2 on 4 attempts.

      Steps to Reproduce:

      1. Install OCP 4.11.46 on Nutanix 
      Jenkins: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/225973/
      Template used for this installation: https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_11/ipi-on-nutanix/versioned-installer-fips-ovn-csi_pvc-ci
      
      install-config.yaml
      
       apiVersion: v1
       controlPlane:
         architecture: amd64
         hyperthreading: Enabled
         name: master
         platform: {}
         replicas: 3
       compute:
       - architecture: amd64
         hyperthreading: Enabled
         name: worker
         platform: {}
         replicas: 2
       metadata:
         name: skordas-15b
       platform:
         nutanix:
           apiVIP: 10.0.132.12
           ingressVIP: 10.0.132.13
           subnetUUIDs:
           - efe26e93-f6cf-4d89-8104-009e85201fa8
           prismCentral:
             username: sgao
             password: HIDDEN
             endpoint:
               address: prismcentral.lts-cluster.nutanix-dev.devcluster.openshift.com
               port: 9440
           prismElements:
           - uuid: 0005d9a4-8e4f-7c33-58d1-e9d0e2d48853
             endpoint:
               address: 10.0.128.159
               port: 9440
       pullSecret: HIDDEN
       networking:
         clusterNetwork:
         - cidr: 10.128.0.0/14
           hostPrefix: 23
         serviceNetwork:
         - 172.30.0.0/16
         machineNetwork:
         - cidr: 10.0.0.0/16
         networkType: OVNKubernetes
       publish: External
       credentialsMode: Manual
       fips: true
       baseDomain: qe.devcluster.openshift.com
       sshKey: SSH-KEY
       
      

      Actual results:

      $ oc get co authentication -o yaml
      
      
      status:
        conditions:
        - lastTransitionTime: "2023-08-15T17:55:33Z"
          message: |-
            APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()
            OAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication ()
            OAuthServerRouteEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.skordas-15b.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
          reason: APIServerDeployment_UnavailablePod::OAuthServerDeployment_UnavailablePod::OAuthServerRouteEndpointAccessibleController_SyncError
          status: "True"
          type: Degraded
        - lastTransitionTime: "2023-08-15T17:53:14Z"
          message: 'AuthenticatorCertKeyProgressing: All is well'
          reason: AsExpected
          status: "False"
          type: Progressing
        - lastTransitionTime: "2023-08-15T17:53:33Z"
          message: 'OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.skordas-15b.qe.devcluster.openshift.com/healthz":
            context deadline exceeded (Client.Timeout exceeded while awaiting headers)'
          reason: OAuthServerRouteEndpointAccessibleController_EndpointUnavailable
          status: "False"
          type: Available
        - lastTransitionTime: "2023-08-15T17:28:03Z"
          message: All is well
          reason: AsExpected
          status: "True"
          type: Upgradeable
      

      Additional info:

      I got this issue trying to gather:
      
      $ oc adm must-gather
      [must-gather      ] OUT Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e6341fd84317f92e74e494a78b8c3a12f576bfcfc4827f4cc7f49da358539eb3
      When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
      ClusterID: fd096799-a755-4fdb-8632-9e8087da3a1e
      ClusterVersion: Stable at "4.11.46"
      ClusterOperators:
      	clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.skordas-15b.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()
      OAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication ()
      OAuthServerRouteEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.skordas-15b.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      	clusteroperator/ingress is degraded because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
      	clusteroperator/machine-config is degraded because Failed to resync 4.11.46 because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error required pool master is not ready, retrying. Status: (total: 3, ready 2, updated: 2, unavailable: 1, degraded: 0)]
      	clusteroperator/openshift-apiserver is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
      

      Attachments

        Activity

          People

            yanhli@redhat.com Yanhua Li
            skordas Simon Kordas
            Shang Gao Shang Gao
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated: