-
Bug
-
Resolution: Done
-
Critical
-
4.13.0
-
Critical
-
No
-
Sprint 232, Sprint 233
-
2
-
Proposed
-
False
-
This is a clone of issue OCPBUGS-6777. The following is the description of the original issue:
—
Description of problem:
"create manifests" without an existing "install-config.yaml" missing 4 YAML files in "<install dir>/openshift" which leads to "create cluster" failure
Version-Release number of selected component (if applicable):
$ ./openshift-install version ./openshift-install 4.13.0-0.nightly-2023-01-27-165107 built from commit fca41376abe654a9124f0450727579bb85591438 release image registry.ci.openshift.org/ocp/release@sha256:29b1bc2026e843d7a2d50844f6f31aa0d7eeb0df540c7d9339589ad889eee529 release architecture amd64
How reproducible:
Always
Steps to Reproduce:
1. "create manifests" 2. "create cluster"
Actual results:
1. After "create manifests", in "<install dir>/openshift", there're 4 YAML files missing, including "99_cloud-creds-secret.yaml", "99_kubeadmin-password-secret.yaml", "99_role-cloud-creds-secret-reader.yaml", and "openshift-install-manifests.yaml", comparing with "create manifests" with an existing "install-config.yaml". 2. The installation failed without any worker nodes due to error getting credentials secret "gcp-cloud-credentials" in namespace "openshift-machine-api".
Expected results:
1. "create manifests" without an existing "install-config.yaml" should generate the same set of YAML files as "create manifests" with an existing "install-config.yaml". 2. Then the subsequent "create cluster" should succeed.
Additional info:
The working scenario: "create manifests" with an existing "install-config.yaml" $ ./openshift-install version ./openshift-install 4.13.0-0.nightly-2023-01-27-165107 built from commit fca41376abe654a9124f0450727579bb85591438 release image registry.ci.openshift.org/ocp/release@sha256:29b1bc2026e843d7a2d50844f6f31aa0d7eeb0df540c7d9339589ad889eee529 release architecture amd64 $ $ mkdir test30 $ cp install-config.yaml test30 $ yq-3.3.0 r test30/install-config.yaml platform gcp: projectID: openshift-qe region: us-central1 $ yq-3.3.0 r test30/install-config.yaml metadata creationTimestamp: null name: jiwei-0130a $ ./openshift-install create manifests --dir test30 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" INFO Consuming Install Config from target directory WARNING Discarding the Openshift Manifests that was provided in the target directory because its dependencies are dirty and it needs to be regenerated INFO Manifests created in: test30/manifests and test30/openshift $ $ tree test30 test30 ├── manifests │ ├── cloud-controller-uid-config.yml │ ├── cloud-provider-config.yaml │ ├── cluster-config.yaml │ ├── cluster-dns-02-config.yml │ ├── cluster-infrastructure-02-config.yml │ ├── cluster-ingress-02-config.yml │ ├── cluster-network-01-crd.yml │ ├── cluster-network-02-config.yml │ ├── cluster-proxy-01-config.yaml │ ├── cluster-scheduler-02-config.yml │ ├── cvo-overrides.yaml │ ├── kube-cloud-config.yaml │ ├── kube-system-configmap-root-ca.yaml │ ├── machine-config-server-tls-secret.yaml │ └── openshift-config-secret-pull-secret.yaml └── openshift ├── 99_cloud-creds-secret.yaml ├── 99_kubeadmin-password-secret.yaml ├── 99_openshift-cluster-api_master-machines-0.yaml ├── 99_openshift-cluster-api_master-machines-1.yaml ├── 99_openshift-cluster-api_master-machines-2.yaml ├── 99_openshift-cluster-api_master-user-data-secret.yaml ├── 99_openshift-cluster-api_worker-machineset-0.yaml ├── 99_openshift-cluster-api_worker-machineset-1.yaml ├── 99_openshift-cluster-api_worker-machineset-2.yaml ├── 99_openshift-cluster-api_worker-machineset-3.yaml ├── 99_openshift-cluster-api_worker-user-data-secret.yaml ├── 99_openshift-machine-api_master-control-plane-machine-set.yaml ├── 99_openshift-machineconfig_99-master-ssh.yaml ├── 99_openshift-machineconfig_99-worker-ssh.yaml ├── 99_role-cloud-creds-secret-reader.yaml └── openshift-install-manifests.yaml2 directories, 31 files $ The problem scenario: "create manifests" without an existing "install-config.yaml", and then "create cluster" $ ./openshift-install create manifests --dir test31 ? SSH Public Key /home/fedora/.ssh/openshift-qe.pub ? Platform gcp INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" ? Project ID OpenShift QE (openshift-qe) ? Region us-central1 ? Base Domain qe.gcp.devcluster.openshift.com ? Cluster Name jiwei-0130b ? Pull Secret [? for help] ******* INFO Manifests created in: test31/manifests and test31/openshift $ $ tree test31 test31 ├── manifests │ ├── cloud-controller-uid-config.yml │ ├── cloud-provider-config.yaml │ ├── cluster-config.yaml │ ├── cluster-dns-02-config.yml │ ├── cluster-infrastructure-02-config.yml │ ├── cluster-ingress-02-config.yml │ ├── cluster-network-01-crd.yml │ ├── cluster-network-02-config.yml │ ├── cluster-proxy-01-config.yaml │ ├── cluster-scheduler-02-config.yml │ ├── cvo-overrides.yaml │ ├── kube-cloud-config.yaml │ ├── kube-system-configmap-root-ca.yaml │ ├── machine-config-server-tls-secret.yaml │ └── openshift-config-secret-pull-secret.yaml └── openshift ├── 99_openshift-cluster-api_master-machines-0.yaml ├── 99_openshift-cluster-api_master-machines-1.yaml ├── 99_openshift-cluster-api_master-machines-2.yaml ├── 99_openshift-cluster-api_master-user-data-secret.yaml ├── 99_openshift-cluster-api_worker-machineset-0.yaml ├── 99_openshift-cluster-api_worker-machineset-1.yaml ├── 99_openshift-cluster-api_worker-machineset-2.yaml ├── 99_openshift-cluster-api_worker-machineset-3.yaml ├── 99_openshift-cluster-api_worker-user-data-secret.yaml ├── 99_openshift-machine-api_master-control-plane-machine-set.yaml ├── 99_openshift-machineconfig_99-master-ssh.yaml └── 99_openshift-machineconfig_99-worker-ssh.yaml2 directories, 27 files $ $ ./openshift-install create cluster --dir test31 INFO Consuming Common Manifests from target directory INFO Consuming Openshift Manifests from target directory INFO Consuming Master Machines from target directory INFO Consuming Worker Machines from target directory INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" INFO Creating infrastructure resources... INFO Waiting up to 20m0s (until 4:17PM) for the Kubernetes API at https://api.jiwei-0130b.qe.gcp.devcluster.openshift.com:6443... INFO API v1.25.2+7dab57f up INFO Waiting up to 30m0s (until 4:28PM) for bootstrapping to complete... INFO Destroying the bootstrap resources... INFO Waiting up to 40m0s (until 4:59PM) for the cluster at https://api.jiwei-0130b.qe.gcp.devcluster.openshift.com:6443 to initialize... ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthClientsController_SyncError::OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_SyncError::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server ERROR OAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route oauth-openshift in namespace openshift-authentication ERROR OAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication ERROR OAuthServerDeploymentDegraded: ERROR OAuthServerRouteEndpointAccessibleControllerDegraded: route "openshift-authentication/oauth-openshift": status does not have a valid host address ERROR OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.99.43:443/healthz": dial tcp 172.30.99.43:443: connect: connection refused ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready ERROR WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this) ERROR Cluster operator authentication Available is False with OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_ResourceNotFound::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointsEndpointAccessibleController_ResourceNotFound::ReadyIngressNodes_NoReadyIngressNodes::WellKnown_NotReady: OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve route from cache: route.route.openshift.io "oauth-openshift" not found ERROR OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.99.43:443/healthz": dial tcp 172.30.99.43:443: connect: connection refused ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found ERROR ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods). ERROR WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this) INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected ERROR Cluster operator cloud-credential Degraded is True with CredentialsFailing: 7 of 7 credentials requests are failing to sync. INFO Cluster operator cloud-credential Progressing is True with Reconciling: 0 of 7 credentials requests provisioned, 7 reporting errors. ERROR Cluster operator cluster-autoscaler Degraded is True with MissingDependency: machine-api not ready ERROR Cluster operator console Degraded is True with DefaultRouteSync_FailedAdmitDefaultRoute::RouteHealth_RouteNotAdmitted::SyncLoopRefresh_FailedIngress: DefaultRouteSyncDegraded: no ingress for host console-openshift-console.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route console in namespace openshift-console ERROR RouteHealthDegraded: console route is not admitted ERROR SyncLoopRefreshDegraded: no ingress for host console-openshift-console.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route console in namespace openshift-console ERROR Cluster operator console Available is False with RouteHealth_RouteNotAdmitted: RouteHealthAvailable: console route is not admitted ERROR Cluster operator control-plane-machine-set Available is False with UnavailableReplicas: Missing 3 available replica(s) ERROR Cluster operator control-plane-machine-set Degraded is True with NoReadyMachines: No ready control plane machines found INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required ERROR Cluster operator image-registry Available is False with DeploymentNotFound: Available: The deployment does not exist ERROR NodeCADaemonAvailable: The daemon set node-ca has available replicas ERROR ImagePrunerAvailable: Pruner CronJob has been created INFO Cluster operator image-registry Progressing is True with Error: Progressing: Unable to apply resources: unable to sync storage configuration: unable to get cluster minted credentials "openshift-image-registry/installer-cloud-credentials": secret "installer-cloud-credentials" not found INFO NodeCADaemonProgressing: The daemon set node-ca is deployed ERROR Cluster operator image-registry Degraded is True with Unavailable: Degraded: The deployment does not exist ERROR Cluster operator ingress Available is False with IngressUnavailable: The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DNSReady=False (NoZones: The record isn't present in any zones.) INFO Cluster operator ingress Progressing is True with Reconciling: ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 2 updated replica(s) are available... INFO ). INFO Not all ingress controllers are available. ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1: Some pods are not scheduled: Pod "router-default-c68b5786c-prk7x" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Pod "router-default-c68b5786c-ssrv7" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Make sure you have sufficient worker nodes.), DNSReady=False (NoZones: The record isn't present in any zones.), CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller) INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected: INFO Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer INFO Cluster operator insights Disabled is False with AsExpected: INFO Cluster operator insights SCAAvailable is True with Updated: SCA certs successfully updated in the etc-pki-entitlement secret ERROR Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host INFO Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.13.0-0.nightly-2023-01-27-165107 ERROR Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.13.0-0.nightly-2023-01-27-165107 because minimum worker replica count (2) not yet met: current running replicas 0, waiting for [jiwei-0130b-25fcm-worker-a-j6t42 jiwei-0130b-25fcm-worker-b-dpw9b jiwei-0130b-25fcm-worker-c-9cdms] ERROR Cluster operator machine-api Available is False with Initializing: Operator is initializing ERROR Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas ERROR Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. INFO Cluster operator network ManagementStateDegraded is False with : INFO Cluster operator network Progressing is True with Deploying: Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready INFO Deployment "/openshift-cloud-network-config-controller/cloud-network-config-controller" is waiting for other operators to become ready INFO Cluster operator storage Progressing is True with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRProgressing: GCPPDDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods ERROR Cluster operator storage Available is False with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRAvailable: GCPPDDriverControllerServiceControllerAvailable: Waiting for Deployment ERROR Cluster initialization failed because one or more operators are not functioning properly. ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation ERROR failed to initialize the cluster: Cluster operators authentication, console, control-plane-machine-set, image-registry, ingress, machine-api, monitoring, storage are not available $ export KUBECONFIG=test31/auth/kubeconfig $ ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 74m Unable to apply 4.13.0-0.nightly-2023-01-27-165107: some cluster operators are not available $ ./oc get nodes NAME STATUS ROLES AGE VERSION jiwei-0130b-25fcm-master-0.c.openshift-qe.internal Ready control-plane,master 69m v1.25.2+7dab57f jiwei-0130b-25fcm-master-1.c.openshift-qe.internal Ready control-plane,master 69m v1.25.2+7dab57f jiwei-0130b-25fcm-master-2.c.openshift-qe.internal Ready control-plane,master 69m v1.25.2+7dab57f $ ./oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jiwei-0130b-25fcm-master-0 73m jiwei-0130b-25fcm-master-1 73m jiwei-0130b-25fcm-master-2 73m jiwei-0130b-25fcm-worker-a-j6t42 65m jiwei-0130b-25fcm-worker-b-dpw9b 65m jiwei-0130b-25fcm-worker-c-9cdms 65m $ ./oc get controlplanemachinesets -n openshift-machine-api NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 Active 74m $ Please see the attached ".openshift_install.log", install-config.yaml snippet, and more "oc" commands outputs.
- clones
-
OCPBUGS-6777 [gcp][CORS-1988] "create manifests" without an existing "install-config.yaml" missing 4 YAML files in "<install dir>/openshift" which leads to "create cluster" failure
- Closed
- is blocked by
-
OCPBUGS-6777 [gcp][CORS-1988] "create manifests" without an existing "install-config.yaml" missing 4 YAML files in "<install dir>/openshift" which leads to "create cluster" failure
- Closed
- links to