Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42214

Failed to provision private HC on AWS

XMLWordPrintable

    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the {ai-full} did not reload new data from the Assisted Service when the {ai-full} checked control plane nodes for readiness and a conflict existed with a write operation from the {ai-full} controller. This conflict prevented the {ai-full} from detecting a node that was marked by the {ai-full} controller as `Ready` because the {ai-full} relied on older information. With this release, the {ai-full} can receive the newest information from the Assisted Service, so that it the {ai-full} can accurately detect the status of each node. (link:https://issues.redhat.com/browse/OCPBUGS-38003[*OCPBUGS-38003*])
      Show
      * Previously, the {ai-full} did not reload new data from the Assisted Service when the {ai-full} checked control plane nodes for readiness and a conflict existed with a write operation from the {ai-full} controller. This conflict prevented the {ai-full} from detecting a node that was marked by the {ai-full} controller as `Ready` because the {ai-full} relied on older information. With this release, the {ai-full} can receive the newest information from the Assisted Service, so that it the {ai-full} can accurately detect the status of each node. (link: https://issues.redhat.com/browse/OCPBUGS-38003 [* OCPBUGS-38003 *])
    • Bug Fix
    • Done

      Description of problem:

      Private HC provision failed on AWS. 

      How reproducible:

      Always. 

      Steps to Reproduce:

      Create a private HC on AWS following the steps in https://hypershift-docs.netlify.app/how-to/aws/deploy-aws-private-clusters/:
      
      RELEASE_IMAGE=registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-20-005211
      HO_IMAGE=quay.io/hypershift/hypershift-operator:latest
      BUCKET_NAME=fxie-hcp-bucket
      REGION=us-east-2
      AWS_CREDS="$HOME/.aws/credentials"
      CLUSTER_NAME=fxie-hcp-1
      BASE_DOMAIN=qe.devcluster.openshift.com
      EXT_DNS_DOMAIN=hypershift-ext.qe.devcluster.openshift.com
      PULL_SECRET="/Users/fxie/Projects/hypershift/.dockerconfigjson"
      
      hypershift install --oidc-storage-provider-s3-bucket-name $BUCKET_NAME --oidc-storage-provider-s3-credentials $AWS_CREDS --oidc-storage-provider-s3-region $REGION --private-platform AWS --aws-private-creds $AWS_CREDS --aws-private-region=$REGION --wait-until-available --hypershift-image $HO_IMAGE
      
      hypershift create cluster aws --pull-secret=$PULL_SECRET --aws-creds=$AWS_CREDS --name=$CLUSTER_NAME --base-domain=$BASE_DOMAIN --node-pool-replicas=2 --region=$REGION --endpoint-access=Private --release-image=$RELEASE_IMAGE --generate-ssh

      Additional info:

      From the MC:
      $ for k in $(oc get secret -n clusters-fxie-hcp-1 | grep -i kubeconfig | awk '{print $1}'); do echo $k; oc extract secret/$k -n clusters-fxie-hcp-1 --to - 2>/dev/null | grep -i 'server:'; done
      admin-kubeconfig
          server: https://a621f63c3c65f4e459f2044b9521b5e9-082a734ef867f25a.elb.us-east-2.amazonaws.com:6443
      aws-pod-identity-webhook-kubeconfig
          server: https://kube-apiserver:6443
      bootstrap-kubeconfig
          server: https://api.fxie-hcp-1.hypershift.local:443
      cloud-credential-operator-kubeconfig
          server: https://kube-apiserver:6443
      dns-operator-kubeconfig
          server: https://kube-apiserver:6443
      fxie-hcp-1-2bsct-kubeconfig
          server: https://kube-apiserver:6443
      ingress-operator-kubeconfig
          server: https://kube-apiserver:6443
      kube-controller-manager-kubeconfig
          server: https://kube-apiserver:6443
      kube-scheduler-kubeconfig
          server: https://kube-apiserver:6443
      localhost-kubeconfig
          server: https://localhost:6443
      service-network-admin-kubeconfig
          server: https://kube-apiserver:6443
      

       

      The bootstrap-kubeconfig uses an incorrect KAS port (should be 6443 since the KAS is exposed through LB), causing kubelet on each HC node to use the same incorrect port. As a result AWS VMs are provisioned but cannot join the HC as nodes.

      From a bastion:
      [ec2-user@ip-10-0-5-182 ~]$ nc -zv api.fxie-hcp-1.hypershift.local 443
      Ncat: Version 7.50 ( https://nmap.org/ncat )
      Ncat: Connection timed out.
      [ec2-user@ip-10-0-5-182 ~]$ nc -zv api.fxie-hcp-1.hypershift.local 6443
      Ncat: Version 7.50 ( https://nmap.org/ncat )
      Ncat: Connected to 10.0.143.91:6443.
      Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
      

       

      Besides, the CNO also passes the wrong KAS port to Network components on the HC.

       

      Same for HA proxy configuration on the VMs:

      frontend local_apiserver
        bind 172.20.0.1:6443
        log global
        mode tcp
        option tcplog
        default_backend remote_apiserver
      
      backend remote_apiserver
        mode tcp
        log global
        option httpchk GET /version
        option log-health-checks
        default-server inter 10s fall 3 rise 3
        server controlplane api.fxie-hcp-1.hypershift.local:443 

              sjenning Seth Jennings
              fxierh Feilian Xie
              Feilian Xie Feilian Xie
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: