Resolution: Not a Bug
Description of problem:
"Bootstrap failed to complete" and compute machines failed on first-boot
Version-Release number of selected component (if applicable):
$ ./openshift-install version ./openshift-install 4.12.0-0.nightly-2022-09-28-204419 built from commit 9eb0224926982cdd6cae53b872326292133e532d release image registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc release architecture amd64
How reproducible:
Always so far, as I'd tried it twice and both the same issue.
Steps to Reproduce:
1. create vpc network, subnets, and a firewall-rule to allow ssh access to the bastion host 2. create the bastion host, with setting a valid service-account and scopes of "https://www.googleapis.com/auth/cloud-platform" 3. scp pull secret to the bastion host 4. ssh to the bastion host (subsequent steps would be on the bastion host, except told explicitly) 5. get "oc", e.g. curl https://mirror2.openshift.com/pub/openshift-v4/clients/ocp/4.9.9/openshift-client-linux-4.9.9.tar.gz -o openshift-client-linux-4.9.9.tar.gz; tar zxvf openshift-client-linux-4.9.9.tar.gz 6. obtain the installation program 7. prepare a valid "install-config.yaml" (as work-around of OCPBUGS-1896) 8. then, please see the attached "create-cluster" for the installation steps/errors
Actual results:
Bootstrap failed, and all compute machines failed the first-boot due to failing on 'GET https://api-int.jiwei-0930-03.qe-shared-vpc.qe.gcp.devcluster.openshift.com:22623/config/worker'.
Expected results:
Installation should succeed.
Additional info:
1. One compute machine serial log: [*** ] A start job is running for Ignition (fetch) (20min 7s / no limit)[ 1211.424359] ignition[909]: GET https://api-int.jiwei-0930-03.qe-shared-vpc.qe.gcp.devcluster.openshift.com:22623/config/worker: attempt #245 [ 1211.437213] ignition[909]: GET result: Internal Server Error 2. After explicitly removing bootstrap from load balancers, the compute nodes turned Ready, but some cluster operators cannot turn available (see below). [cloud-user@jiwei-0930-02-rhel8-mirror ~]$ ./oc get nodes NAME STATUS ROLES AGE VERSION jiwei-0930-03-rrhmn-master-0.c.openshift-qe-shared-vpc.internal Ready control-plane,master 94m v1.24.0+8c7c967 jiwei-0930-03-rrhmn-master-1.c.openshift-qe-shared-vpc.internal Ready control-plane,master 95m v1.24.0+8c7c967 jiwei-0930-03-rrhmn-master-2.c.openshift-qe-shared-vpc.internal Ready control-plane,master 95m v1.24.0+8c7c967 jiwei-0930-03-rrhmn-worker-a-4b5n4 Ready worker 14m v1.24.0+8c7c967 jiwei-0930-03-rrhmn-worker-b-bjzkw Ready worker 14m v1.24.0+8c7c967 [cloud-user@jiwei-0930-02-rhel8-mirror ~]$ ./oc get clusteroperator | grep -v "True False False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-09-28-204419 False True True 92m WellKnownAvailable: The well-known endpoint is not yet available: kube-apiserver oauth endpoint is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance) console 4.12.0-0.nightly-2022-09-28-204419 False False True 7m26s RouteHealthAvailable: console route is not admitted ingress 4.12.0-0.nightly-2022-09-28-204419 True False True 13m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller) kube-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False True 89m GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on no such host monitoring False False True 76m Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. [cloud-user@jiwei-0930-02-rhel8-mirror ~]$ 3. Please see http://virt-openshift-05.lab.eng.nay.redhat.com/jiwei/CORS-2260/ for must-gather and bootstrap logs, and the sample "install-config.yaml".
- is related to
CORS-2286 QE Tracker
- Closed