Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7481

[gcp][CORS-1988] "create manifests" without an existing "install-config.yaml" missing 4 YAML files in "<install dir>/openshift" which leads to "create cluster" failure

    XMLWordPrintable

Details

    • Critical
    • No
    • Sprint 232, Sprint 233
    • 2
    • Proposed
    • False
    • Hide

      None

      Show
      None

    Description

      This is a clone of issue OCPBUGS-6777. The following is the description of the original issue:

      Description of problem:

      "create manifests" without an existing "install-config.yaml" missing 4 YAML files in "<install dir>/openshift" which leads to "create cluster" failure

      Version-Release number of selected component (if applicable):

      $ ./openshift-install version
      ./openshift-install 4.13.0-0.nightly-2023-01-27-165107
      built from commit fca41376abe654a9124f0450727579bb85591438
      release image registry.ci.openshift.org/ocp/release@sha256:29b1bc2026e843d7a2d50844f6f31aa0d7eeb0df540c7d9339589ad889eee529
      release architecture amd64
      

      How reproducible:

      Always

      Steps to Reproduce:

      1. "create manifests"
      2. "create cluster" 

      Actual results:

      1. After "create manifests", in "<install dir>/openshift", there're 4 YAML files missing, including "99_cloud-creds-secret.yaml", "99_kubeadmin-password-secret.yaml", "99_role-cloud-creds-secret-reader.yaml", and "openshift-install-manifests.yaml", comparing with "create manifests" with an existing "install-config.yaml".
      2. The installation failed without any worker nodes due to error getting credentials secret "gcp-cloud-credentials" in namespace "openshift-machine-api".
      

      Expected results:

      1. "create manifests" without an existing "install-config.yaml" should generate the same set of YAML files as "create manifests" with an existing "install-config.yaml".
      2. Then the subsequent "create cluster" should succeed.
      

      Additional info:

      The working scenario: "create manifests" with an existing "install-config.yaml"
      
      $ ./openshift-install version
      ./openshift-install 4.13.0-0.nightly-2023-01-27-165107
      built from commit fca41376abe654a9124f0450727579bb85591438
      release image registry.ci.openshift.org/ocp/release@sha256:29b1bc2026e843d7a2d50844f6f31aa0d7eeb0df540c7d9339589ad889eee529
      release architecture amd64
      $ 
      $ mkdir test30
      $ cp install-config.yaml test30
      $ yq-3.3.0 r test30/install-config.yaml platform
      gcp:
        projectID: openshift-qe
        region: us-central1
      $ yq-3.3.0 r test30/install-config.yaml metadata
      creationTimestamp: null
      name: jiwei-0130a
      $ ./openshift-install create manifests --dir test30
      INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
      INFO Consuming Install Config from target directory 
      WARNING Discarding the Openshift Manifests that was provided in the target directory because its dependencies are dirty and it needs to be regenerated 
      INFO Manifests created in: test30/manifests and test30/openshift 
      $ 
      $ tree test30
      test30
      ├── manifests
      │   ├── cloud-controller-uid-config.yml
      │   ├── cloud-provider-config.yaml
      │   ├── cluster-config.yaml
      │   ├── cluster-dns-02-config.yml
      │   ├── cluster-infrastructure-02-config.yml
      │   ├── cluster-ingress-02-config.yml
      │   ├── cluster-network-01-crd.yml
      │   ├── cluster-network-02-config.yml
      │   ├── cluster-proxy-01-config.yaml
      │   ├── cluster-scheduler-02-config.yml
      │   ├── cvo-overrides.yaml
      │   ├── kube-cloud-config.yaml  
      │   ├── kube-system-configmap-root-ca.yaml
      │   ├── machine-config-server-tls-secret.yaml
      │   └── openshift-config-secret-pull-secret.yaml
      └── openshift
          ├── 99_cloud-creds-secret.yaml
          ├── 99_kubeadmin-password-secret.yaml
          ├── 99_openshift-cluster-api_master-machines-0.yaml
          ├── 99_openshift-cluster-api_master-machines-1.yaml
          ├── 99_openshift-cluster-api_master-machines-2.yaml
          ├── 99_openshift-cluster-api_master-user-data-secret.yaml
          ├── 99_openshift-cluster-api_worker-machineset-0.yaml
          ├── 99_openshift-cluster-api_worker-machineset-1.yaml
          ├── 99_openshift-cluster-api_worker-machineset-2.yaml
          ├── 99_openshift-cluster-api_worker-machineset-3.yaml
          ├── 99_openshift-cluster-api_worker-user-data-secret.yaml
          ├── 99_openshift-machine-api_master-control-plane-machine-set.yaml
          ├── 99_openshift-machineconfig_99-master-ssh.yaml
          ├── 99_openshift-machineconfig_99-worker-ssh.yaml
          ├── 99_role-cloud-creds-secret-reader.yaml
          └── openshift-install-manifests.yaml2 directories, 31 files
      $ 
      
      The problem scenario: "create manifests" without an existing "install-config.yaml", and then "create cluster"
      
      $ ./openshift-install create manifests --dir test31
      ? SSH Public Key /home/fedora/.ssh/openshift-qe.pub
      ? Platform gcp
      INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
      ? Project ID OpenShift QE (openshift-qe)
      ? Region us-central1
      ? Base Domain qe.gcp.devcluster.openshift.com
      ? Cluster Name jiwei-0130b
      ? Pull Secret [? for help] *******
      INFO Manifests created in: test31/manifests and test31/openshift
      $ 
      $ tree test31
      test31
      ├── manifests
      │   ├── cloud-controller-uid-config.yml
      │   ├── cloud-provider-config.yaml
      │   ├── cluster-config.yaml
      │   ├── cluster-dns-02-config.yml
      │   ├── cluster-infrastructure-02-config.yml
      │   ├── cluster-ingress-02-config.yml
      │   ├── cluster-network-01-crd.yml
      │   ├── cluster-network-02-config.yml
      │   ├── cluster-proxy-01-config.yaml
      │   ├── cluster-scheduler-02-config.yml
      │   ├── cvo-overrides.yaml
      │   ├── kube-cloud-config.yaml
      │   ├── kube-system-configmap-root-ca.yaml
      │   ├── machine-config-server-tls-secret.yaml
      │   └── openshift-config-secret-pull-secret.yaml
      └── openshift
          ├── 99_openshift-cluster-api_master-machines-0.yaml
          ├── 99_openshift-cluster-api_master-machines-1.yaml
          ├── 99_openshift-cluster-api_master-machines-2.yaml
          ├── 99_openshift-cluster-api_master-user-data-secret.yaml
          ├── 99_openshift-cluster-api_worker-machineset-0.yaml
          ├── 99_openshift-cluster-api_worker-machineset-1.yaml
          ├── 99_openshift-cluster-api_worker-machineset-2.yaml
          ├── 99_openshift-cluster-api_worker-machineset-3.yaml
          ├── 99_openshift-cluster-api_worker-user-data-secret.yaml
          ├── 99_openshift-machine-api_master-control-plane-machine-set.yaml
          ├── 99_openshift-machineconfig_99-master-ssh.yaml
          └── 99_openshift-machineconfig_99-worker-ssh.yaml2 directories, 27 files
      $ 
      $ ./openshift-install create cluster --dir test31
      INFO Consuming Common Manifests from target directory
      INFO Consuming Openshift Manifests from target directory
      INFO Consuming Master Machines from target directory
      INFO Consuming Worker Machines from target directory
      INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
      INFO Creating infrastructure resources...
      INFO Waiting up to 20m0s (until 4:17PM) for the Kubernetes API at https://api.jiwei-0130b.qe.gcp.devcluster.openshift.com:6443...
      INFO API v1.25.2+7dab57f up
      INFO Waiting up to 30m0s (until 4:28PM) for bootstrapping to complete...
      INFO Destroying the bootstrap resources...
      INFO Waiting up to 40m0s (until 4:59PM) for the cluster at https://api.jiwei-0130b.qe.gcp.devcluster.openshift.com:6443 to initialize...
      ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthClientsController_SyncError::OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_SyncError::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
      ERROR OAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route oauth-openshift in namespace openshift-authentication
      ERROR OAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication
      ERROR OAuthServerDeploymentDegraded:
      ERROR OAuthServerRouteEndpointAccessibleControllerDegraded: route "openshift-authentication/oauth-openshift": status does not have a valid host address
      ERROR OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.99.43:443/healthz": dial tcp 172.30.99.43:443: connect: connection refused
      ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready
      ERROR WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
      ERROR Cluster operator authentication Available is False with OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_ResourceNotFound::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointsEndpointAccessibleController_ResourceNotFound::ReadyIngressNodes_NoReadyIngressNodes::WellKnown_NotReady: OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve route from cache: route.route.openshift.io "oauth-openshift" not found
      ERROR OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.99.43:443/healthz": dial tcp 172.30.99.43:443: connect: connection refused
      ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found
      ERROR ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
      ERROR WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
      INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
      INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
      INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
      INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
      INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
      ERROR Cluster operator cloud-credential Degraded is True with CredentialsFailing: 7 of 7 credentials requests are failing to sync.
      INFO Cluster operator cloud-credential Progressing is True with Reconciling: 0 of 7 credentials requests provisioned, 7 reporting errors.
      ERROR Cluster operator cluster-autoscaler Degraded is True with MissingDependency: machine-api not ready
      ERROR Cluster operator console Degraded is True with DefaultRouteSync_FailedAdmitDefaultRoute::RouteHealth_RouteNotAdmitted::SyncLoopRefresh_FailedIngress: DefaultRouteSyncDegraded: no ingress for host console-openshift-console.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route console in namespace openshift-console
      ERROR RouteHealthDegraded: console route is not admitted
      ERROR SyncLoopRefreshDegraded: no ingress for host console-openshift-console.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route console in namespace openshift-console
      ERROR Cluster operator console Available is False with RouteHealth_RouteNotAdmitted: RouteHealthAvailable: console route is not admitted 
      ERROR Cluster operator control-plane-machine-set Available is False with UnavailableReplicas: Missing 3 available replica(s)
      ERROR Cluster operator control-plane-machine-set Degraded is True with NoReadyMachines: No ready control plane machines found
      INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
      ERROR Cluster operator image-registry Available is False with DeploymentNotFound: Available: The deployment does not exist
      ERROR NodeCADaemonAvailable: The daemon set node-ca has available replicas
      ERROR ImagePrunerAvailable: Pruner CronJob has been created
      INFO Cluster operator image-registry Progressing is True with Error: Progressing: Unable to apply resources: unable to sync storage configuration: unable to get cluster minted credentials "openshift-image-registry/installer-cloud-credentials": secret "installer-cloud-credentials" not found
      INFO NodeCADaemonProgressing: The daemon set node-ca is deployed
      ERROR Cluster operator image-registry Degraded is True with Unavailable: Degraded: The deployment does not exist
      ERROR Cluster operator ingress Available is False with IngressUnavailable: The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DNSReady=False (NoZones: The record isn't present in any zones.)
      INFO Cluster operator ingress Progressing is True with Reconciling: ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 2 updated replica(s) are available...
      INFO ).
      INFO Not all ingress controllers are available.
      ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1: Some pods are not scheduled: Pod "router-default-c68b5786c-prk7x" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Pod "router-default-c68b5786c-ssrv7" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Make sure you have sufficient worker nodes.), DNSReady=False (NoZones: The record isn't present in any zones.), CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
      INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected:
      INFO Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer
      INFO Cluster operator insights Disabled is False with AsExpected:
      INFO Cluster operator insights SCAAvailable is True with Updated: SCA certs successfully updated in the etc-pki-entitlement secret
      ERROR Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host  
      INFO Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.13.0-0.nightly-2023-01-27-165107
      ERROR Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.13.0-0.nightly-2023-01-27-165107 because minimum worker replica count (2) not yet met: current running replicas 0, waiting for [jiwei-0130b-25fcm-worker-a-j6t42 jiwei-0130b-25fcm-worker-b-dpw9b jiwei-0130b-25fcm-worker-c-9cdms]
      ERROR Cluster operator machine-api Available is False with Initializing: Operator is initializing
      ERROR Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas
      ERROR Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas
      INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
      INFO Cluster operator network ManagementStateDegraded is False with :
      INFO Cluster operator network Progressing is True with Deploying: Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
      INFO Deployment "/openshift-cloud-network-config-controller/cloud-network-config-controller" is waiting for other operators to become ready
      INFO Cluster operator storage Progressing is True with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRProgressing: GCPPDDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods
      ERROR Cluster operator storage Available is False with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRAvailable: GCPPDDriverControllerServiceControllerAvailable: Waiting for Deployment
      ERROR Cluster initialization failed because one or more operators are not functioning properly.
      ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
      ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
      ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
      ERROR failed to initialize the cluster: Cluster operators authentication, console, control-plane-machine-set, image-registry, ingress, machine-api, monitoring, storage are not available
      $ export KUBECONFIG=test31/auth/kubeconfig 
      $ ./oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       True          74m     Unable to apply 4.13.0-0.nightly-2023-01-27-165107: some cluster operators are not available
      $ ./oc get nodes
      NAME                                                 STATUS   ROLES                  AGE   VERSION
      jiwei-0130b-25fcm-master-0.c.openshift-qe.internal   Ready    control-plane,master   69m   v1.25.2+7dab57f
      jiwei-0130b-25fcm-master-1.c.openshift-qe.internal   Ready    control-plane,master   69m   v1.25.2+7dab57f
      jiwei-0130b-25fcm-master-2.c.openshift-qe.internal   Ready    control-plane,master   69m   v1.25.2+7dab57f
      $ ./oc get machines -n openshift-machine-api
      NAME                               PHASE   TYPE   REGION   ZONE   AGE
      jiwei-0130b-25fcm-master-0                                        73m
      jiwei-0130b-25fcm-master-1                                        73m
      jiwei-0130b-25fcm-master-2                                        73m
      jiwei-0130b-25fcm-worker-a-j6t42                                  65m
      jiwei-0130b-25fcm-worker-b-dpw9b                                  65m
      jiwei-0130b-25fcm-worker-c-9cdms                                  65m
      $ ./oc get controlplanemachinesets -n openshift-machine-api
      NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE    AGE
      cluster   3         3                           3             Active   74m
      $ 
      
      Please see the attached ".openshift_install.log", install-config.yaml snippet, and more "oc" commands outputs.
      

       

       

       

       

       

      Attachments

        Issue Links

          Activity

            People

              rna-afk Aditya Narayanaswamy
              openshift-crt-jira-prow OpenShift Prow Bot
              Jianli Wei Jianli Wei
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: