Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2636

Deploying an IPI cluster with OVNKubernetes on AWS with m6g.metal arm64 machines fails

XMLWordPrintable

    • None
    • Proposed
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      IPI clusters with OVNKubernetes on AWS with m6g[d]?.metal machines fail to install.
      In particular:
      - the console op. is not available due to "missing replicas" (pods are in CrashLoopBackOff because unable to reach the oauth routes);
      - the canary-routes pods are up, but they are unreachable via their route;- the router-default pods seem up;
      
      - the Classic LB on AWS is only bound to the master instances and the instances are reported as unhealthy. The HTTP healthcheck is failed on the AWS console: it doesn’t fail if changing from HTTP to the TCP SYN/ACK check;- however, curl requests on the healthcheck uri from any other cluster node to any other report:
      
      curl 10.0.215.132:32407/healthz
      { "service": { "namespace": "openshift-ingress", "name": "router-default" }, "localEndpoints": 0|1 (based on the node) }
      
      
      - in a test, deleting the ingress controller and letting it to be re-created, made the reconciliation to conclude successfully and the installation to finish:
      oc -n openshift-ingress-operator delete ingresscontroller/default 

      Version-Release number of selected component (if applicable):

      4.12.0-0.nightly-arm64-2022-10-18-153953

      How reproducible:

      always

      Steps to Reproduce:

      1.Install a OVNKubernetes IPI on AWS cluster on m6g[d].metal nodes
      

      Actual results:

      The installation fails: 
      oc get co
      NAME                                       VERSION                                    AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.12.0-0.nightly-arm64-2022-10-18-153953   False       False         True       9h      OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.adistefa-1020f.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      baremetal                                  4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      cloud-controller-manager                   4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      cloud-credential                           4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      cluster-autoscaler                         4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      config-operator                            4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      console                                    4.12.0-0.nightly-arm64-2022-10-18-153953   False       True          False      9h      DeploymentAvailable: 0 replicas available for console deployment...
      control-plane-machine-set                  4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      csi-snapshot-controller                    4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      dns                                        4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      etcd                                       4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      image-registry                             4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      8h      
      ingress                                    4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         True       8h      The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
      insights                                   4.12.0-0.nightly-arm64-2022-10-18-153953   False       False         True       56m     Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: UHC services authentication failed...
      kube-apiserver                             4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      kube-controller-manager                    4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      kube-scheduler                             4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      kube-storage-version-migrator              4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      machine-api                                4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      8h      
      machine-approver                           4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      machine-config                             4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      marketplace                                4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      monitoring                                 4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      8h      
      network                                    4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      node-tuning                                4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      openshift-apiserver                        4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      openshift-controller-manager               4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      openshift-samples                          4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      operator-lifecycle-manager                 4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      operator-lifecycle-manager-catalog         4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      operator-lifecycle-manager-packageserver   4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      service-ca                                 4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h      
      storage                                    4.12.0-0.nightly-arm64-2022-10-18-153953   True        False         False      9h    
      
      

      Expected results:

      The installation succeed

      Additional info:

      Additional info:
      - It works when the VM type is not a metal machine
      - It works when using OpenshiftSDN
      - It works when using m5.metal machines and x86 nightlies (4.12.0-0.nightly-2022-10-18-192348)
      - must-gather: https://drive.google.com/file/d/1S-q6PPomreWAzfFZY8r5brWuq6Ea-wJn/view?usp=sharing 
      
      
      install-config.yaml:
      
       ---
       apiVersion: v1
       controlPlane:
         architecture: arm64
         hyperthreading: Enabled
         name: master
         platform:
           aws:
             type: m6gd.metal
         replicas: 3
       compute:
       - architecture: arm64
         hyperthreading: Enabled
         name: worker
         platform:
           aws:
             type: m6gd.metal
         replicas: 3
       metadata:
         name: adistefa-1020f
       platform:
         aws:
           region: us-east-2
       pullSecret: HIDDEN
       networking:
         clusterNetwork:
         - cidr: 10.128.0.0/14
           hostPrefix: 23
         serviceNetwork:
         - 172.30.0.0/16
         machineNetwork:
         - cidr: 10.0.0.0/16
         networkType: OVNKubernetes
       publish: External
       baseDomain: qe.devcluster.openshift.com
       sshKey: -
       
       
      
      
      

              jeffdyoung Jeff Young
              rhn-support-adistefa Alessandro Di Stefano
              Alessandro Di Stefano Alessandro Di Stefano
              Hongan Li, Lin Wang, Sharada Vetsa, Yunfei Jiang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: