Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2877

[gcp][CORS-1774] with "credentialsMode: Manual", the ingress operator degraded with error "Required 'compute.firewalls.get' permission" unexpectedly

XMLWordPrintable

    • None
    • Proposed
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      1. The pre-created service-account for the ingress operator had been granted "roles/compute.networkUser" which includes the permission "compute.firewalls.get", but the operator tells the error.
      2. "createFirewallRules" is set as Disabled, so that the installer won't create any firewall-rules, and the mentioned rule "k8s-fw-..." doesn't exist at all. 
      

      Version-Release number of selected component (if applicable):

      $ openshift-install version
      openshift-install 4.12.0-0.nightly-2022-10-25-210451
      built from commit 14d496fdaec571fa97604a487f5df6a0433c0c68
      release image registry.ci.openshift.org/ocp/release@sha256:d6cc07402fee12197ca1a8592b5b781f9f9a84b55883f126d60a3896a36a9b74
      release architecture amd64
      

      How reproducible:

      Always

      Steps to Reproduce:

      1. try IPI installation to a shared VPC, with "credentialsMode" being "Manual"
      

      Actual results:

      Installation failed, and the ingress operator turned degraded.
      $ oc get co ingress
      NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      ingress             False       True          True       49m     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a35d52ba3a1c44a2d9fc8449034eb663', forbidden...
      $ 
      

      Expected results:

      The installation should succeed, even CCO in manual mode (as told by https://github.com/openshift/openshift-docs/pull/51171).

      Additional info:

      1. the pre-configured DNS zones in the service project, and the firewall-rules in the host project
      $ gcloud dns managed-zones list --filter='name=qe1'
      NAME  DNS_NAME                           DESCRIPTION  VISIBILITY
      qe1   qe1.gcp.devcluster.openshift.com.               public
      $ gcloud dns managed-zones list --filter='name=ipi-xpn-private-zone'
      NAME                  DNS_NAME                                       DESCRIPTION                         VISIBILITY
      ipi-xpn-private-zone  jiwei-1026a.qe1.gcp.devcluster.openshift.com.  Preserved private zone for IPI XPN  private
      $ gcloud --project openshift-qe-shared-vpc compute firewall-rules list --filter='network=installer-shared-vpc AND NOT name~ci-op-xpn' 2> /dev/null
      NAME                                NETWORK               DIRECTION  PRIORITY  ALLOW                                                                                                         DENY  DISABLED
      preserved-ipi-xpn-api               installer-shared-vpc  INGRESS    1000      tcp:6443,tcp:80,tcp:443                                                                                             False
      preserved-ipi-xpn-bastion-access    installer-shared-vpc  INGRESS    1000      tcp:22,tcp:3128-3129,tcp:5000,tcp:6001-6002,tcp:8080                                                                False
      preserved-ipi-xpn-control-plane     installer-shared-vpc  INGRESS    1000      tcp:22623,tcp:10257,tcp:10259                                                                                       False
      preserved-ipi-xpn-etcd              installer-shared-vpc  INGRESS    1000      tcp:2379-2380                                                                                                       False
      preserved-ipi-xpn-health-checks     installer-shared-vpc  INGRESS    1000      tcp:6080,tcp:6443,tcp:22624,tcp:30000-32767                                                                         False
      preserved-ipi-xpn-internal-cluster  installer-shared-vpc  INGRESS    1000      tcp:30000-32767,udp:30000-32767,tcp:9000-9999,udp:9000-9999,udp:4789,udp:6081,udp:500,udp:4500,tcp:10250,esp        False
      preserved-ipi-xpn-internal-network  installer-shared-vpc  INGRESS    1000      tcp:22,icmp                                                                                                         False
      $ gcloud iam roles describe roles/compute.networkUser | grep compute.firewalls.get
      - compute.firewalls.get
      $ 
      
      2. the install-config snippet
      $ yq-3.3.0 r test4/install-config.yaml platform
      gcp:
        projectID: openshift-qe
        region: us-central1
        computeSubnet: installer-shared-vpc-subnet-2
        controlPlaneSubnet: installer-shared-vpc-subnet-1
        createFirewallRules: Disabled
        publicDNSZone:
          id: qe1
        privateDNSZone:
          id: ipi-xpn-private-zone
        network: installer-shared-vpc
        networkProjectID: openshift-qe-shared-vpc
      $ yq-3.3.0 r test4/install-config.yaml baseDomain
      qe1.gcp.devcluster.openshift.com
      $ yq-3.3.0 r test4/install-config.yaml credentialsMode
      Manual
      $ yq-3.3.0 r test4/install-config.yaml compute
      - architecture: amd64
        hyperthreading: Enabled
        name: worker
        platform:
          gcp:
            tags:
            - preserved-ipi-xpn-compute
        replicas: 2
      $ yq-3.3.0 r test4/install-config.yaml controlPlane
      architecture: amd64
      hyperthreading: Enabled
      name: master
      platform:
        gcp:
          tags:
          - preserved-ipi-xpn-control-plane
      replicas: 3
      $ yq-3.3.0 r test4/install-config.yaml metadata
      creationTimestamp: null
      name: jiwei-1026a
      $ openshift-install create manifests --dir test4
      INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
      INFO Consuming Install Config from target directory
      INFO Manifests created in: test4/manifests and test4/openshift
      $ 
      
      3. manually create the required credentials and then copy the manifests to the installation dir
      $ ./gcp_cco_helper.sh registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451 us-central1 test4 pull_secret.json 
      ......
      $ cp cco-manifests/* test4/manifests/
      $ ls test4/manifests/ -lrt
      total 100
      -rw-r-----. 1 fedora fedora  4345 Oct 26 12:32 openshift-config-secret-pull-secret.yaml
      -rw-r-----. 1 fedora fedora  4086 Oct 26 12:32 machine-config-server-tls-secret.yaml
      -rw-r-----. 1 fedora fedora  1304 Oct 26 12:32 kube-system-configmap-root-ca.yaml
      -rw-r-----. 1 fedora fedora   118 Oct 26 12:32 kube-cloud-config.yaml
      -rw-r-----. 1 fedora fedora   200 Oct 26 12:32 cvo-overrides.yaml
      -rw-r-----. 1 fedora fedora   171 Oct 26 12:32 cluster-scheduler-02-config.yml
      -rw-r-----. 1 fedora fedora   142 Oct 26 12:32 cluster-proxy-01-config.yaml
      -rw-r-----. 1 fedora fedora   273 Oct 26 12:32 cluster-network-02-config.yml
      -rw-r-----. 1 fedora fedora 10135 Oct 26 12:32 cluster-network-01-crd.yml
      -rw-r-----. 1 fedora fedora   248 Oct 26 12:32 cluster-ingress-02-config.yml
      -rw-r-----. 1 fedora fedora   644 Oct 26 12:32 cluster-infrastructure-02-config.yml
      -rw-r-----. 1 fedora fedora   216 Oct 26 12:32 cluster-dns-02-config.yml
      -rw-r-----. 1 fedora fedora  2314 Oct 26 12:32 cluster-config.yaml
      -rw-r-----. 1 fedora fedora   545 Oct 26 12:32 cloud-provider-config.yaml
      -rw-r-----. 1 fedora fedora   175 Oct 26 12:32 cloud-controller-uid-config.yml
      -rw-rw-r--. 1 fedora fedora  3270 Oct 26 12:48 99_openshift-machine-api_gcp-cloud-credentials-secret.yaml
      -rw-rw-r--. 1 fedora fedora  3267 Oct 26 12:48 99_openshift-ingress-operator_cloud-credentials-secret.yaml
      -rw-rw-r--. 1 fedora fedora  3283 Oct 26 12:48 99_openshift-image-registry_installer-cloud-credentials-secret.yaml
      -rw-rw-r--. 1 fedora fedora  3277 Oct 26 12:48 99_openshift-cluster-csi-drivers_gcp-pd-cloud-credentials-secret.yaml
      -rw-rw-r--. 1 fedora fedora  3286 Oct 26 12:48 99_openshift-cloud-network-config-controller_cloud-credentials-secret.yaml
      -rw-rw-r--. 1 fedora fedora  3301 Oct 26 12:48 99_openshift-cloud-credential-operator_cloud-credential-operator-gcp-ro-creds-secret.yaml
      -rw-rw-r--. 1 fedora fedora  3291 Oct 26 12:48 99_openshift-cloud-controller-manager_gcp-ccm-cloud-credentials-secret.yaml
      $ 
      
      4. try creating cluster, which failed finally, due to ingress operator degraded
      $ openshift-install create cluster --dir test4
      INFO Consuming Worker Machines from target directory
      INFO Consuming Master Machines from target directory
      INFO Consuming OpenShift Install (Manifests) from target directory
      INFO Consuming Openshift Manifests from target directory
      INFO Consuming Common Manifests from target directory
      INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
      WARNING FeatureSet "TechPreviewNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster.
      INFO Creating infrastructure resources...
      INFO Waiting up to 20m0s (until 1:11PM) for the Kubernetes API at https://api.jiwei-1026a.qe1.gcp.devcluster.openshift.com:6443...
      INFO API v1.25.2+4bd0702 up
      INFO Waiting up to 30m0s (until 1:23PM) for bootstrapping to complete...
      INFO Destroying the bootstrap resources...
      INFO Waiting up to 40m0s (until 1:47PM) for the cluster at https://api.jiwei-1026a.qe1.gcp.devcluster.openshift.com:6443 to initialize...
      ERROR Cluster operator authentication Degraded is True with OAuthServerRouteEndpointAccessibleController_SyncError: OAuthServerRouteEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.jiwei-1026a.qe1.gcp.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.jiwei-1026a.qe1.gcp.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
      ERROR Cluster operator authentication Available is False with OAuthServerRouteEndpointAccessibleController_EndpointUnavailable: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.jiwei-1026a.qe1.gcp.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.jiwei-1026a.qe1.gcp.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
      INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
      INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
      INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
      INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
      INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
      INFO Cluster operator cloud-controller-manager CloudControllerOwner is True with AsExpected: Cluster Cloud Controller Manager Operator owns cloud controllers at 4.12.0-0.nightly-2022-10-25-210451
      INFO Cluster operator cluster-api SecretSyncControllerAvailable is True with AsExpected: User Data Secret Controller works as expected
      INFO Cluster operator cluster-api SecretSyncControllerDegraded is False with AsExpected: User Data Secret Controller works as expected
      INFO Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.12.0-0.nightly-2022-10-25-210451, 0 replicas available
      ERROR Cluster operator console Available is False with Deployment_InsufficientReplicas::RouteHealth_FailedGet: DeploymentAvailable: 0 replicas available for console deployment
      ERROR RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.jiwei-1026a.qe1.gcp.devcluster.openshift.com): Get "https://console-openshift-console.apps.jiwei-1026a.qe1.gcp.devcluster.openshift.com": dial tcp: lookup console-openshift-console.apps.jiwei-1026a.qe1.gcp.devcluster.openshift.com on 172.30.0.10:53: no such host
      INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
      ERROR Cluster operator ingress Available is False with IngressUnavailable: The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a35d52ba3a1c44a2d9fc8449034eb663', forbidden
      ERROR The kube-controller-manager logs may contain more details.)
      INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.
      ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a35d52ba3a1c44a2d9fc8449034eb663', forbidden
      ERROR The kube-controller-manager logs may contain more details.), CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
      INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected:
      INFO Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer
      INFO Cluster operator insights Disabled is False with AsExpected:
      INFO Cluster operator insights SCAAvailable is True with Updated: SCA certs successfully updated in the etc-pki-entitlement secret
      INFO Cluster operator network ManagementStateDegraded is False with :
      ERROR Cluster initialization failed because one or more operators are not functioning properly.
      ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
      ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
      ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
      ERROR failed to initialize the cluster: Cluster operators authentication, console, ingress are not available
      $ 
      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       True          57m     Unable to apply 4.12.0-0.nightly-2022-10-25-210451: some cluster operators are not available
      $ oc get nodes
      NAME                                                       STATUS   ROLES                  AGE   VERSION
      jiwei-1026a-sx4ph-master-0.c.openshift-qe.internal         Ready    control-plane,master   57m   v1.25.2+4bd0702
      jiwei-1026a-sx4ph-master-1.c.openshift-qe.internal         Ready    control-plane,master   57m   v1.25.2+4bd0702
      jiwei-1026a-sx4ph-master-2.c.openshift-qe.internal         Ready    control-plane,master   55m   v1.25.2+4bd0702
      jiwei-1026a-sx4ph-worker-a-9xhnn.c.openshift-qe.internal   Ready    worker                 44m   v1.25.2+4bd0702
      jiwei-1026a-sx4ph-worker-b-ctfw9.c.openshift-qe.internal   Ready    worker                 44m   v1.25.2+4bd0702
      $ oc get co | grep -v 'True        False         False'
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.12.0-0.nightly-2022-10-25-210451   False       False         True       53m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.jiwei-1026a.qe1.gcp.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.jiwei-1026a.qe1.gcp.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
      console                                    4.12.0-0.nightly-2022-10-25-210451   False       True          False      43m     DeploymentAvailable: 0 replicas available for console deployment...
      ingress                                                                         False       True          True       43m     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a35d52ba3a1c44a2d9fc8449034eb663', forbidden...
      $ 
      $ oc get pods -n openshift-ingress-operator
      NAME                                READY   STATUS    RESTARTS      AGE
      ingress-operator-84d549fd76-nfr4l   2/2     Running   2 (47m ago)   57m
      $ oc logs ingress-operator-84d549fd76-nfr4l -n openshift-ingress-operator
      ......
      2022-10-26T13:53:31.279Z        ERROR   operator.ingress_controller     controller/controller.go:121    got retryable error; requeueing{"after": "1m0s", "error": "IngressController is degraded: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a35d52ba3a1c44a2d9fc8449034eb663', forbidden\nThe kube-controller-manager logs may contain more details.), CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}
      2022-10-26T13:53:58.388Z        ERROR   operator.canary_controller      wait/wait.go:157        error performing canary route check    {"error": "error sending canary HTTP request: DNS error: Get \"https://canary-openshift-ingress-canary.apps.jiwei-1026a.qe1.gcp.devcluster.openshift.com\": dial tcp: lookup canary-openshift-ingress-canary.apps.jiwei-1026a.qe1.gcp.devcluster.openshift.com on 172.30.0.10:53: no such host"}
      $ 
      $ gcloud --project openshift-qe-shared-vpc compute firewall-rules list --filter='network=installer-shared-vpc AND NOT name~ci-op-xpn' 2> /dev/null
      NAME                                NETWORK               DIRECTION  PRIORITY  ALLOW                                                                                                         DENY  DISABLED
      preserved-ipi-xpn-api               installer-shared-vpc  INGRESS    1000      tcp:6443,tcp:80,tcp:443                                                                                             False
      preserved-ipi-xpn-bastion-access    installer-shared-vpc  INGRESS    1000      tcp:22,tcp:3128-3129,tcp:5000,tcp:6001-6002,tcp:8080                                                                False
      preserved-ipi-xpn-control-plane     installer-shared-vpc  INGRESS    1000      tcp:22623,tcp:10257,tcp:10259                                                                                       False
      preserved-ipi-xpn-etcd              installer-shared-vpc  INGRESS    1000      tcp:2379-2380                                                                                                       False
      preserved-ipi-xpn-health-checks     installer-shared-vpc  INGRESS    1000      tcp:6080,tcp:6443,tcp:22624,tcp:30000-32767                                                                         False
      preserved-ipi-xpn-internal-cluster  installer-shared-vpc  INGRESS    1000      tcp:30000-32767,udp:30000-32767,tcp:9000-9999,udp:9000-9999,udp:4789,udp:6081,udp:500,udp:4500,tcp:10250,esp        False
      preserved-ipi-xpn-internal-network  installer-shared-vpc  INGRESS    1000      tcp:22,icmp                                                                                                         False
      $ 
      

       

       

       

              Unassigned Unassigned
              rhn-support-jiwei Jianli Wei
              Jianli Wei Jianli Wei
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: