Uploaded image for project: 'OpenShift Installer'
  1. OpenShift Installer
  2. CORS-2366

Ensure XPN works with Private Clusters

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • None
    • Installer Core
    • None
    • Sprint 230
    • Proposed

      User Story:

      I want to install a private cluster through GCP XPN IPI so that no public endpoints are exposed with my shared vpc.

       

       

      =================================================

      QE tested this in 4.12 and his this issue:

      Description of problem:

      ingress operator complains worker node should be in control plane subnet unexpectedly, so that 'wait-for install-complete' failed

      Version-Release number of selected component (if applicable):

      $ openshift-install version
      openshift-install 4.12.0-0.nightly-2022-10-15-094115
      built from commit c5d7528d759ea808dbd3291101ec40fd222e1273
      release image registry.ci.openshift.org/ocp/release@sha256:55d8660794fbf33031e83e3b3489dc3718290ccbba38ab056d02ffe4a25274c4
      release architecture amd64
      

      How reproducible:

      Always

      Steps to Reproduce:

      1. try IPI XPN installation with "publish" being "Internal", i.e. to deploy a private cluster 

      Actual results:

      $ oc get co ingress
      NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      ingress             False       True          True       12s     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-b/instances/jiwei-1017-00-4cbln-worker-b-ddsqz' is expected to be in the subnetwork 'projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' but is in the subnetwork 'projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-2'., wrongSubnetwork...
      $ 
      

      Expected results:

      There should be no such error and the installation should succeed.

      Additional info:

      1. the google cloud credential does have enough permissions
      $ gcloud config get account
      jiwei@redhat.com
      $ gcloud config get project
      openshift-qe
      $ gcloud projects get-iam-policy openshift-qe --flatten="bindings[].members" --format="table(bindings.role)" --filter="bindings.members:ipi-xpn-min-permissions@openshift-qe.iam.gserviceaccount.com"
      ROLE
      roles/compute.admin
      roles/compute.instanceAdmin.v1
      roles/compute.loadBalancerAdmin
      roles/compute.storageAdmin
      roles/dns.admin
      roles/iam.roleViewer
      roles/iam.securityAdmin
      roles/iam.securityReviewer
      roles/iam.serviceAccountAdmin
      roles/iam.serviceAccountKeyAdmin
      roles/iam.serviceAccountUser
      roles/storage.admin
      $ gcloud projects get-iam-policy openshift-qe-shared-vpc --flatten="bindings[].members" --format="table(bindings.role)" --filter="bindings.members:ipi-xpn-min-permissions@openshift-qe.iam.gserviceaccount.com"
      ROLE
      projects/openshift-qe-shared-vpc/roles/dns.networks.bindPrivateDNSZone
      roles/compute.networkUser
      $ 
      
      2. the install-config snipppets$ gcloud config get account
      ipi-xpn-min-permissions@openshift-qe.iam.gserviceaccount.com
      $ gcloud config get project
      openshift-qe
      $ 
      $ mkdir work11
      $ cp install-config.yaml work11
      $ yq-3.3.0 r work11/install-config.yaml platform
      gcp:
        projectID: openshift-qe
        region: us-central1
        computeSubnet: installer-shared-vpc-subnet-2
        controlPlaneSubnet: installer-shared-vpc-subnet-1
        createFirewallRules: Disabled
        network: installer-shared-vpc
        networkProjectID: openshift-qe-shared-vpc
      $ yq-3.3.0 r work11/install-config.yaml publish
      Internal
      $ yq-3.3.0 r work11/install-config.yaml featureSet
      TechPreviewNoUpgrade
      $ yq-3.3.0 r work11/install-config.yaml compute
      - architecture: amd64
        hyperthreading: Enabled
        name: worker
        platform:
          gcp:
            tags:
            - preserved-ipi-xpn-compute
        replicas: 2
      $ yq-3.3.0 r work11/install-config.yaml controlPlane
      architecture: amd64
      hyperthreading: Enabled
      name: master
      platform:
        gcp:
          tags:
          - preserved-ipi-xpn-control-plane
      replicas: 3
      $ 
      $ export http_proxy=http://<username>:<password>@<bastion public ip>:3128
      $ export https_proxy=http://<username>:<password>@<bastion public ip>:3128
      $ 
      
      3. try "create cluster" with the above install-config$ openshift-install create cluster --dir work11
      INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
      INFO Consuming Install Config from target directory
      WARNING FeatureSet "TechPreviewNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster.
      INFO Creating infrastructure resources...
      INFO Waiting up to 20m0s (until 11:31AM) for the Kubernetes API at https://api.jiwei-1017-00.qe.gcp.devcluster.openshift.com:6443...
      INFO API v1.25.2+5bf2e1f up
      INFO Waiting up to 30m0s (until 11:44AM) for bootstrapping to complete...
      INFO Destroying the bootstrap resources...
      INFO Waiting up to 40m0s (until 12:15PM) for the cluster at https://api.jiwei-1017-00.qe.gcp.devcluster.openshift.com:6443 to initialize...
      ERROR Cluster operator authentication Degraded is True with OAuthServerRouteEndpointAccessibleController_SyncError: OAuthServerRouteEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.jiwei-1017-00.qe.gcp.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.jiwei-1017-00.qe.gcp.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
      ERROR Cluster operator authentication Available is False with OAuthServerRouteEndpointAccessibleController_EndpointUnavailable: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.jiwei-1017-00.qe.gcp.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.jiwei-1017-00.qe.gcp.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
      INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
      INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
      INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
      INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
      INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
      INFO Cluster operator cloud-controller-manager CloudControllerOwner is True with AsExpected: Cluster Cloud Controller Manager Operator owns cloud controllers at 4.12.0-0.nightly-2022-10-15-094115
      INFO Cluster operator cluster-api SecretSyncControllerAvailable is True with AsExpected: User Data Secret Controller works as expected
      INFO Cluster operator cluster-api SecretSyncControllerDegraded is False with AsExpected: User Data Secret Controller works as expected
      INFO Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.12.0-0.nightly-2022-10-15-094115, 0 replicas available
      ERROR Cluster operator console Available is False with Deployment_InsufficientReplicas::RouteHealth_FailedGet: DeploymentAvailable: 0 replicas available for console deployment
      ERROR RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.jiwei-1017-00.qe.gcp.devcluster.openshift.com): Get "https://console-openshift-console.apps.jiwei-1017-00.qe.gcp.devcluster.openshift.com": dial tcp: lookup console-openshift-console.apps.jiwei-1017-00.qe.gcp.devcluster.openshift.com on 172.30.0.10:53: no such host
      INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
      ERROR Cluster operator ingress Available is False with IngressUnavailable: The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-b/instances/jiwei-1017-00-4cbln-worker-b-ddsqz' is expected to be in the subnetwork 'projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' but is in the subnetwork 'projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-2'., wrongSubnetwork
      ERROR The kube-controller-manager logs may contain more details.)
      INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.
      ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
      INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected:
      INFO Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer
      INFO Cluster operator insights Disabled is False with AsExpected:
      INFO Cluster operator insights SCAAvailable is True with Updated: SCA certs successfully updated in the etc-pki-entitlement secret
      INFO Cluster operator network ManagementStateDegraded is False with :
      ERROR Cluster initialization failed because one or more operators are not functioning properly.
      ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
      ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
      ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
      ERROR failed to initialize the cluster: Cluster operators authentication, console, ingress are not available
      $ 
      $ export KUBECONFIG=work11/auth/kubeconfig
      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       True          63m     Unable to apply 4.12.0-0.nightly-2022-10-15-094115: some cluster operators are not available
      $ oc get nodes
      NAME                                                         STATUS   ROLES                  AGE   VERSION
      jiwei-1017-00-4cbln-master-0.c.openshift-qe.internal         Ready    control-plane,master   62m   v1.25.2+5bf2e1f
      jiwei-1017-00-4cbln-master-1.c.openshift-qe.internal         Ready    control-plane,master   63m   v1.25.2+5bf2e1f
      jiwei-1017-00-4cbln-master-2.c.openshift-qe.internal         Ready    control-plane,master   63m   v1.25.2+5bf2e1f
      jiwei-1017-00-4cbln-worker-a-skr9w.c.openshift-qe.internal   Ready    worker                 42m   v1.25.2+5bf2e1f
      jiwei-1017-00-4cbln-worker-b-ddsqz.c.openshift-qe.internal   Ready    worker                 42m   v1.25.2+5bf2e1f
      $ oc get co | grep -v "True        False         False"
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.12.0-0.nightly-2022-10-15-094115   False       False         True       58m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.jiwei-1017-00.qe.gcp.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.jiwei-1017-00.qe.gcp.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
      console                                    4.12.0-0.nightly-2022-10-15-094115   False       True          False      43m     DeploymentAvailable: 0 replicas available for console deployment...
      ingress                                                                         False       True          True       13s     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-b/instances/jiwei-1017-00-4cbln-worker-b-ddsqz' is expected to be in the subnetwork 'projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' but is in the subnetwork 'projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-2'., wrongSubnetwork...
      $ 
      
      4. some related GCP resources$ gcloud compute instances list --filter='name~jiwei-1017-00'
      NAME                                ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP   STATUS
      jiwei-1017-00-4cbln-master-0        us-central1-a  n2-standard-4               10.0.0.54                  RUNNING
      jiwei-1017-00-4cbln-worker-a-skr9w  us-central1-a  n2-standard-4               10.0.32.56                 RUNNING
      jiwei-1017-00-rhel8-bastion         us-central1-a  n1-standard-1               10.0.0.40    35.223.25.38  RUNNING
      jiwei-1017-00-4cbln-master-1        us-central1-b  n2-standard-4               10.0.0.55                  RUNNING
      jiwei-1017-00-4cbln-worker-b-ddsqz  us-central1-b  n2-standard-4               10.0.32.57                 RUNNING
      jiwei-1017-00-4cbln-master-2        us-central1-c  n2-standard-4               10.0.0.53                  RUNNING
      $ gcloud compute instances describe jiwei-1017-00-4cbln-master-0 --zone us-central1-a --format json | jq -r .networkInterfaces[0].subnetwork
      https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1
      $ gcloud compute instances describe jiwei-1017-00-4cbln-master-1 --zone us-central1-b --format json | jq -r .networkInterfaces[0].subnetwork
      https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1
      $ gcloud compute instances describe jiwei-1017-00-4cbln-master-2 --zone us-central1-c --format json | jq -r .networkInterfaces[0].subnetwork
      https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1
      $ gcloud compute instances describe jiwei-1017-00-4cbln-worker-a-skr9w --zone us-central1-a --format json | jq -r .networkInterfaces[0].subnetwork
      https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-2
      $ gcloud compute instances describe jiwei-1017-00-4cbln-worker-b-ddsqz --zone us-central1-b --format json | jq -r .networkInterfaces[0].subnetwork
      https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-2
      $ 
      $ gcloud --project openshift-qe-shared-vpc compute networks subnets describe installer-shared-vpc-subnet-1 --region us-central1
      creationTimestamp: '2022-04-11T06:01:14.454-07:00'
      fingerprint: _iSHt-Vys1Y=
      gatewayAddress: 10.0.0.1
      id: '6501392071863670901'
      ipCidrRange: 10.0.0.0/19
      kind: compute#subnetwork
      name: installer-shared-vpc-subnet-1
      network: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/networks/installer-shared-vpc
      privateIpGoogleAccess: false
      privateIpv6GoogleAccess: DISABLE_GOOGLE_ACCESS
      purpose: PRIVATE
      region: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/regions/us-central1
      selfLink: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1
      stackType: IPV4_ONLY
      $ 
      $ gcloud --project openshift-qe-shared-vpc compute networks subnets describe installer-shared-vpc-subnet-2 --region us-central1
      creationTimestamp: '2022-04-11T06:01:13.334-07:00'
      fingerprint: yC6-bWh3OGc=
      gatewayAddress: 10.0.32.1
      id: '3231194790049192054'
      ipCidrRange: 10.0.32.0/19
      kind: compute#subnetwork
      name: installer-shared-vpc-subnet-2
      network: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/networks/installer-shared-vpc
      privateIpGoogleAccess: false
      privateIpv6GoogleAccess: DISABLE_GOOGLE_ACCESS
      purpose: PRIVATE
      region: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/regions/us-central1
      selfLink: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-2
      stackType: IPV4_ONLY
      $ 
      $ gcloud --project openshift-qe-shared-vpc compute networks describe installer-shared-vpc
      autoCreateSubnetworks: false
      creationTimestamp: '2022-04-11T06:00:54.339-07:00'
      id: '3925591109463021673'
      kind: compute#network
      name: installer-shared-vpc
      networkFirewallPolicyEnforcementOrder: AFTER_CLASSIC_FIREWALL
      routingConfig:
        routingMode: REGIONAL
      selfLink: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/networks/installer-shared-vpc
      selfLinkWithId: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/networks/3925591109463021673
      subnetworks:
      - https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-2
      - https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1
      x_gcloud_bgp_routing_mode: REGIONAL
      x_gcloud_subnet_mode: CUSTOM
      $ 
      
      5. the must-gather logs: http://virt-openshift-05.lab.eng.nay.redhat.com/jiwei/jiwei-1017-00-4cbln/must-gather.local.2217384992678477281.tar
      

       

       

       

       

              rh-ee-bbarbach Brent Barbachem
              rhn-support-jiwei Jianli Wei
              Hongan Li Hongan Li
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: