Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1911

[CORS-2260] "Bootstrap failed to complete" and compute machines failed on first-boot

    XMLWordPrintable

Details

    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      "Bootstrap failed to complete" and compute machines failed on first-boot

      Version-Release number of selected component (if applicable):

      $ ./openshift-install version
      ./openshift-install 4.12.0-0.nightly-2022-09-28-204419
      built from commit 9eb0224926982cdd6cae53b872326292133e532d
      release image registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc
      release architecture amd64
      

      How reproducible:

      Always so far, as I'd tried it twice and both the same issue.

      Steps to Reproduce:

      1. create vpc network, subnets, and a firewall-rule to allow ssh access to the bastion host
      2. create the bastion host, with setting a valid service-account and scopes of "https://www.googleapis.com/auth/cloud-platform"
      3. scp pull secret to the bastion host
      4. ssh to the bastion host (subsequent steps would be on the bastion host, except told explicitly)
      5. get "oc", e.g. curl https://mirror2.openshift.com/pub/openshift-v4/clients/ocp/4.9.9/openshift-client-linux-4.9.9.tar.gz -o openshift-client-linux-4.9.9.tar.gz; tar zxvf openshift-client-linux-4.9.9.tar.gz
      6. obtain the installation program
      7. prepare a valid "install-config.yaml" (as work-around of OCPBUGS-1896)
      8. then, please see the attached "create-cluster" for the installation steps/errors
      

      Actual results:

      Bootstrap failed, and all compute machines failed the first-boot due to failing on 'GET https://api-int.jiwei-0930-03.qe-shared-vpc.qe.gcp.devcluster.openshift.com:22623/config/worker'.

      Expected results:

      Installation should succeed.

      Additional info:

      1. One compute machine serial log: 
      [***   ] A start job is running for Ignition (fetch) (20min 7s / no limit)[ 1211.424359] ignition[909]: GET https://api-int.jiwei-0930-03.qe-shared-vpc.qe.gcp.devcluster.openshift.com:22623/config/worker: attempt #245
      [ 1211.437213] ignition[909]: GET result: Internal Server Error
      
      2. After explicitly removing bootstrap from load balancers, the compute nodes turned Ready, but some cluster operators cannot turn available (see below).
      [cloud-user@jiwei-0930-02-rhel8-mirror ~]$ ./oc get nodes
      NAME                                                              STATUS   ROLES                  AGE   VERSION
      jiwei-0930-03-rrhmn-master-0.c.openshift-qe-shared-vpc.internal   Ready    control-plane,master   94m   v1.24.0+8c7c967
      jiwei-0930-03-rrhmn-master-1.c.openshift-qe-shared-vpc.internal   Ready    control-plane,master   95m   v1.24.0+8c7c967
      jiwei-0930-03-rrhmn-master-2.c.openshift-qe-shared-vpc.internal   Ready    control-plane,master   95m   v1.24.0+8c7c967
      jiwei-0930-03-rrhmn-worker-a-4b5n4                                Ready    worker                 14m   v1.24.0+8c7c967
      jiwei-0930-03-rrhmn-worker-b-bjzkw                                Ready    worker                 14m   v1.24.0+8c7c967
      [cloud-user@jiwei-0930-02-rhel8-mirror ~]$ ./oc get clusteroperator | grep -v "True        False         False"
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.12.0-0.nightly-2022-09-28-204419   False       True          True       92m     WellKnownAvailable: The well-known endpoint is not yet available: kube-apiserver oauth endpoint https://10.0.0.6:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)
      console                                    4.12.0-0.nightly-2022-09-28-204419   False       False         True       7m26s   RouteHealthAvailable: console route is not admitted
      ingress                                    4.12.0-0.nightly-2022-09-28-204419   True        False         True       13m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
      kube-controller-manager                    4.12.0-0.nightly-2022-09-28-204419   True        False         True       89m     GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host
      monitoring                                                                      False       False         True       76m     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
      [cloud-user@jiwei-0930-02-rhel8-mirror ~]$ 
      
      3. Please see http://virt-openshift-05.lab.eng.nay.redhat.com/jiwei/CORS-2260/ for must-gather and bootstrap logs, and the sample "install-config.yaml".
      

       

       

      Attachments

        Issue Links

          Activity

            People

              padillon Patrick Dillon
              rhn-support-jiwei Jianli Wei
              Jianli Wei Jianli Wei
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: