Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2432

authentication and console cluster operators are not available after migrating loaded worker nodes to larger instance types on AWS

    XMLWordPrintable

Details

    • 1
    • Sprint 228
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      After migrating a cluster's worker node instances with larger instance types, multiple cluster operators are degraded and authentication and console cluster operators are not available. 
      
      oc get co | egrep -v 'True.*False.*False'
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.12.0-0.nightly-2022-10-05-053337   False       False         True       3h46m   OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.sv-aws-412.qe.devcluster.openshift.com/healthz": EOF
      console                                    4.12.0-0.nightly-2022-10-05-053337   False       False         False      3h46m   RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.sv-aws-412.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.sv-aws-412.qe.devcluster.openshift.com": EOF
      

      Version-Release number of selected component (if applicable):

      # oc version
      Client Version: 4.12.0-0.nightly-2022-10-05-053337
      Kustomize Version: v4.5.4
      Server Version: 4.12.0-0.nightly-2022-10-05-053337
      Kubernetes Version: v1.25.0+3ef6ef3

      How reproducible:

      Replace loaded worker nodes with larger instance types and perform cluster health check.

      Steps to Reproduce:

      1. Create a cluster with 3 master nodes ('m5.xlarge') and 30 worker nodes ('m5.xlarge') with OVN.
      2. Run kube-burner cluster-density workload (https://github.com/cloud-bulldozer/e2e-benchmarking)
      3. Note CPU and Memory resource usage of Master nodes
      4. Create a new machineset of 15 worker nodes with ('m5.2xlarge') instance type.
      5. Scale down existing machineset (using 'm5.xlarge') one at a time to 0.
      6. All cluster-density pods should be successfully migrated to new machineset.
      7. Delete all cluster-density namespaces
      8. Rerun kube burner cluster-density test on the cluster with only new machineset.
      9. Notice that the test fails trying to get Route for prometheus and when you check cluster health: authentication, console and ingress cluster operators are reported as degraded.
      
      

      Actual results:

      oc get co | egrep -v 'True.*False.*False'
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.12.0-0.nightly-2022-10-05-053337   False       False         True       3h46m   OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.sv-aws-412.qe.devcluster.openshift.com/healthz": EOF
      console                                    4.12.0-0.nightly-2022-10-05-053337   False       False         False      3h46m   RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.sv-aws-412.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.sv-aws-412.qe.devcluster.openshift.com": EOF
      
      

      Expected results:

      No cluster operators should be degraded.
      Should be able to run kube burner cluster-density workload successfully on new machineset.

      Additional info:

       

      Attachments

        Activity

          People

            gspence@redhat.com Grant Spence
            svetsa@redhat.com Sharada Vetsa
            Ke Wang Ke Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: