Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23094

[gcp] IPI or UPI private cluster on GCP failed due to ingress LB stuck in Pending

    XMLWordPrintable

Details

    • Critical
    • No
    • CLOUD Sprint 244, CLOUD Sprint 245
    • 2
    • Approved
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

    Description

      Description of problem:

      IPI or UPI installing a private cluster on GCP always fail, with the cluster operator ingress telling LoadBalancerPending and CanaryChecksRepetitiveFailures

      Version-Release number of selected component (if applicable):

      4.15.0-0.nightly-2023-11-07-233748

      How reproducible:

      Always

      Steps to Reproduce:

      1. create a private cluster on GCP, either IPI or UPI 

      Actual results:

      The installation failed, with ingress operator degraded.

      Expected results:

      The installation can succeed.

      Additional info:

      Some PROW CI tests: 
      
      https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-arm64-nightly-gcp-ipi-private-f28-longduration-cloud/1722352860160593920 (Must-gather https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-arm64-nightly-gcp-ipi-private-f28-longduration-cloud/1722352860160593920/artifacts/gcp-ipi-private-f28-longduration-cloud/gather-must-gather/artifacts/must-gather.tar)
      
      https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-gcp-ipi-xpn-private-f28/1722176483704705024
      
      https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-gcp-ipi-private-fips-f6-disasterrecovery/1722066338567950336
      
      
      FYI QE Flexy-install jobs: IPI Flexy-install/245364/, UPI Flexy-install/245524/
      
      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       True          14h     Unable to apply 4.15.0-0.nightly-2023-11-07-233748: some cluster operators are not available
      $ oc get nodes
      NAME                                                           STATUS   ROLES                  AGE   VERSION
      jiwei-1108-priv-kx7b4-master-0.c.openshift-qe.internal         Ready    control-plane,master   14h   v1.28.3+4cbdd29
      jiwei-1108-priv-kx7b4-master-1.c.openshift-qe.internal         Ready    control-plane,master   14h   v1.28.3+4cbdd29
      jiwei-1108-priv-kx7b4-master-2.c.openshift-qe.internal         Ready    control-plane,master   14h   v1.28.3+4cbdd29
      jiwei-1108-priv-kx7b4-worker-a-l28pl.c.openshift-qe.internal   Ready    worker                 14h   v1.28.3+4cbdd29
      jiwei-1108-priv-kx7b4-worker-b-84bx5.c.openshift-qe.internal   Ready    worker                 14h   v1.28.3+4cbdd29
      $ oc get co
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.15.0-0.nightly-2023-11-07-233748   False       False         True       14h     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.jiwei-1108-priv.qe.gcp.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.jiwei-1108-priv.qe.gcp.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
      baremetal                                  4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      cloud-controller-manager                   4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      cloud-credential                           4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      cluster-autoscaler                         4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      config-operator                            4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      console                                    4.15.0-0.nightly-2023-11-07-233748   False       True          False      14h     DeploymentAvailable: 0 replicas available for console deployment...
      control-plane-machine-set                  4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      csi-snapshot-controller                    4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      dns                                        4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      etcd                                       4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      image-registry                             4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      ingress                                                                         False       True          True       7h37m   The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending)
      insights                                   4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      kube-apiserver                             4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      kube-controller-manager                    4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      kube-scheduler                             4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      kube-storage-version-migrator              4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      machine-api                                4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      machine-approver                           4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      machine-config                             4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      marketplace                                4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      monitoring                                 4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      network                                    4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      node-tuning                                4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      openshift-apiserver                        4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      openshift-controller-manager               4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      openshift-samples                          4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      operator-lifecycle-manager                 4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      operator-lifecycle-manager-catalog         4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      operator-lifecycle-manager-packageserver   4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      service-ca                                 4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      storage                                    4.15.0-0.nightly-2023-11-07-233748   True        False         False      14h     
      $ oc describe co ingress
      Name:         ingress
      Namespace:    
      Labels:       <none>
      Annotations:  include.release.openshift.io/ibm-cloud-managed: true
                    include.release.openshift.io/self-managed-high-availability: true
                    include.release.openshift.io/single-node-developer: true
      API Version:  config.openshift.io/v1
      Kind:         ClusterOperator
      Metadata:
        Creation Timestamp:  2023-11-08T10:38:15Z
        Generation:          1
        Owner References:
          API Version:     config.openshift.io/v1
          Controller:      true
          Kind:            ClusterVersion
          Name:            version
          UID:             dbaae892-1b6d-480d-a201-0549d0a3149d
        Resource Version:  172514
        UID:               3922a9fe-584f-458f-ac4f-b62b4842758e
      Spec:
      Status:
        Conditions:
          Last Transition Time:  2023-11-08T17:49:01Z
          Message:               The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending)
          Reason:                IngressUnavailable
          Status:                False
          Type:                  Available
          Last Transition Time:  2023-11-08T11:02:27Z
          Message:               Not all ingress controllers are available.
          Reason:                Reconciling
          Status:                True
          Type:                  Progressing
          Last Transition Time:  2023-11-08T17:51:01Z
          Message:               The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending), CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
          Reason:                IngressDegraded
          Status:                True
          Type:                  Degraded
          Last Transition Time:  2023-11-08T10:52:36Z
          Reason:                IngressControllersUpgradeable
          Status:                True
          Type:                  Upgradeable
          Last Transition Time:  2023-11-08T10:52:36Z
          Reason:                AsExpected
          Status:                False
          Type:                  EvaluationConditionsDetected
        Extension:               <nil>
        Related Objects:
          Group:      
          Name:       openshift-ingress-operator
          Resource:   namespaces
          Group:      operator.openshift.io
          Name:       
          Namespace:  openshift-ingress-operator
          Resource:   ingresscontrollers
          Group:      ingress.operator.openshift.io
          Name:       
          Namespace:  openshift-ingress-operator
          Resource:   dnsrecords
          Group:      
          Name:       openshift-ingress
          Resource:   namespaces
          Group:      
          Name:       openshift-ingress-canary
          Resource:   namespaces
      Events:         <none>
      $ oc get pods -n openshift-ingress-operator -o wide
      NAME                                READY   STATUS    RESTARTS      AGE   IP            NODE                                                     NOMINATED NODE   READINESS GATES
      ingress-operator-57c555c75b-gqbk6   2/2     Running   2 (14h ago)   14h   10.129.0.36   jiwei-1108-priv-kx7b4-master-1.c.openshift-qe.internal   <none>           <none>
      $ oc -n openshift-ingress-operator logs ingress-operator-57c555c75b-gqbk6
      ...output omitted...
      2023-11-08T10:56:53.715Z    ERROR    operator.ingress_controller    controller/controller.go:118    got retryable error; requeueing    {"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1: Some pods are not scheduled: Pod \"router-default-7c86c4f4b5-jsljz\" cannot be scheduled: 0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.. Pod \"router-default-7c86c4f4b5-pltz4\" cannot be scheduled: 0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.. Make sure you have sufficient worker nodes.), LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: INSTANCE_IN_MULTIPLE_LOAD_BALANCED_IGS - Validation failed for instance 'projects/openshift-qe/zones/us-central1-a/instances/jiwei-1108-priv-kx7b4-master-0': instance may belong to at most one load-balanced instance group.\nThe kube-controller-manager logs may contain more details.)"}
      ...output omitted...
      2023-11-08T15:13:41.323Z    ERROR    operator.ingress_controller    controller/controller.go:118    got retryable error; requeueing    {"after": "1m0s", "error": "IngressController is degraded: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-b/instances/jiwei-1108-priv-kx7b4-worker-b-84bx5' is expected to be in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/jiwei-1108-priv-master-subnet' but is in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/jiwei-1108-priv-worker-subnet'., wrongSubnetwork\nThe kube-controller-manager logs may contain more details.), CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}
      ...output omitted...
      $ 
      
      Must-gather https://drive.google.com/file/d/1zwhJ4ga0-tQuRorha4XnUGUKbSTx1fx4/view?usp=drive_link

      Attachments

        Activity

          People

            rh-ee-nbrubake Nolan Brubaker
            rhn-support-jiwei Jianli Wei
            Jianli Wei Jianli Wei
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: