-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.15.0
-
Critical
-
No
-
CLOUD Sprint 244, CLOUD Sprint 245
-
2
-
Approved
-
False
-
-
N/A
-
Release Note Not Required
Description of problem:
IPI or UPI installing a private cluster on GCP always fail, with the cluster operator ingress telling LoadBalancerPending and CanaryChecksRepetitiveFailures
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-11-07-233748
How reproducible:
Always
Steps to Reproduce:
1. create a private cluster on GCP, either IPI or UPI
Actual results:
The installation failed, with ingress operator degraded.
Expected results:
The installation can succeed.
Additional info:
Some PROW CI tests: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-arm64-nightly-gcp-ipi-private-f28-longduration-cloud/1722352860160593920 (Must-gather https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-arm64-nightly-gcp-ipi-private-f28-longduration-cloud/1722352860160593920/artifacts/gcp-ipi-private-f28-longduration-cloud/gather-must-gather/artifacts/must-gather.tar) https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-gcp-ipi-xpn-private-f28/1722176483704705024 https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-gcp-ipi-private-fips-f6-disasterrecovery/1722066338567950336 FYI QE Flexy-install jobs: IPI Flexy-install/245364/, UPI Flexy-install/245524/ $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 14h Unable to apply 4.15.0-0.nightly-2023-11-07-233748: some cluster operators are not available $ oc get nodes NAME STATUS ROLES AGE VERSION jiwei-1108-priv-kx7b4-master-0.c.openshift-qe.internal Ready control-plane,master 14h v1.28.3+4cbdd29 jiwei-1108-priv-kx7b4-master-1.c.openshift-qe.internal Ready control-plane,master 14h v1.28.3+4cbdd29 jiwei-1108-priv-kx7b4-master-2.c.openshift-qe.internal Ready control-plane,master 14h v1.28.3+4cbdd29 jiwei-1108-priv-kx7b4-worker-a-l28pl.c.openshift-qe.internal Ready worker 14h v1.28.3+4cbdd29 jiwei-1108-priv-kx7b4-worker-b-84bx5.c.openshift-qe.internal Ready worker 14h v1.28.3+4cbdd29 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.nightly-2023-11-07-233748 False False True 14h OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.jiwei-1108-priv.qe.gcp.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.jiwei-1108-priv.qe.gcp.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server) baremetal 4.15.0-0.nightly-2023-11-07-233748 True False False 14h cloud-controller-manager 4.15.0-0.nightly-2023-11-07-233748 True False False 14h cloud-credential 4.15.0-0.nightly-2023-11-07-233748 True False False 14h cluster-autoscaler 4.15.0-0.nightly-2023-11-07-233748 True False False 14h config-operator 4.15.0-0.nightly-2023-11-07-233748 True False False 14h console 4.15.0-0.nightly-2023-11-07-233748 False True False 14h DeploymentAvailable: 0 replicas available for console deployment... control-plane-machine-set 4.15.0-0.nightly-2023-11-07-233748 True False False 14h csi-snapshot-controller 4.15.0-0.nightly-2023-11-07-233748 True False False 14h dns 4.15.0-0.nightly-2023-11-07-233748 True False False 14h etcd 4.15.0-0.nightly-2023-11-07-233748 True False False 14h image-registry 4.15.0-0.nightly-2023-11-07-233748 True False False 14h ingress False True True 7h37m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending) insights 4.15.0-0.nightly-2023-11-07-233748 True False False 14h kube-apiserver 4.15.0-0.nightly-2023-11-07-233748 True False False 14h kube-controller-manager 4.15.0-0.nightly-2023-11-07-233748 True False False 14h kube-scheduler 4.15.0-0.nightly-2023-11-07-233748 True False False 14h kube-storage-version-migrator 4.15.0-0.nightly-2023-11-07-233748 True False False 14h machine-api 4.15.0-0.nightly-2023-11-07-233748 True False False 14h machine-approver 4.15.0-0.nightly-2023-11-07-233748 True False False 14h machine-config 4.15.0-0.nightly-2023-11-07-233748 True False False 14h marketplace 4.15.0-0.nightly-2023-11-07-233748 True False False 14h monitoring 4.15.0-0.nightly-2023-11-07-233748 True False False 14h network 4.15.0-0.nightly-2023-11-07-233748 True False False 14h node-tuning 4.15.0-0.nightly-2023-11-07-233748 True False False 14h openshift-apiserver 4.15.0-0.nightly-2023-11-07-233748 True False False 14h openshift-controller-manager 4.15.0-0.nightly-2023-11-07-233748 True False False 14h openshift-samples 4.15.0-0.nightly-2023-11-07-233748 True False False 14h operator-lifecycle-manager 4.15.0-0.nightly-2023-11-07-233748 True False False 14h operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-11-07-233748 True False False 14h operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-11-07-233748 True False False 14h service-ca 4.15.0-0.nightly-2023-11-07-233748 True False False 14h storage 4.15.0-0.nightly-2023-11-07-233748 True False False 14h $ oc describe co ingress Name: ingress Namespace: Labels: <none> Annotations: include.release.openshift.io/ibm-cloud-managed: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2023-11-08T10:38:15Z Generation: 1 Owner References: API Version: config.openshift.io/v1 Controller: true Kind: ClusterVersion Name: version UID: dbaae892-1b6d-480d-a201-0549d0a3149d Resource Version: 172514 UID: 3922a9fe-584f-458f-ac4f-b62b4842758e Spec: Status: Conditions: Last Transition Time: 2023-11-08T17:49:01Z Message: The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending) Reason: IngressUnavailable Status: False Type: Available Last Transition Time: 2023-11-08T11:02:27Z Message: Not all ingress controllers are available. Reason: Reconciling Status: True Type: Progressing Last Transition Time: 2023-11-08T17:51:01Z Message: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending), CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing) Reason: IngressDegraded Status: True Type: Degraded Last Transition Time: 2023-11-08T10:52:36Z Reason: IngressControllersUpgradeable Status: True Type: Upgradeable Last Transition Time: 2023-11-08T10:52:36Z Reason: AsExpected Status: False Type: EvaluationConditionsDetected Extension: <nil> Related Objects: Group: Name: openshift-ingress-operator Resource: namespaces Group: operator.openshift.io Name: Namespace: openshift-ingress-operator Resource: ingresscontrollers Group: ingress.operator.openshift.io Name: Namespace: openshift-ingress-operator Resource: dnsrecords Group: Name: openshift-ingress Resource: namespaces Group: Name: openshift-ingress-canary Resource: namespaces Events: <none> $ oc get pods -n openshift-ingress-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ingress-operator-57c555c75b-gqbk6 2/2 Running 2 (14h ago) 14h 10.129.0.36 jiwei-1108-priv-kx7b4-master-1.c.openshift-qe.internal <none> <none> $ oc -n openshift-ingress-operator logs ingress-operator-57c555c75b-gqbk6 ...output omitted... 2023-11-08T10:56:53.715Z ERROR operator.ingress_controller controller/controller.go:118 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1: Some pods are not scheduled: Pod \"router-default-7c86c4f4b5-jsljz\" cannot be scheduled: 0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.. Pod \"router-default-7c86c4f4b5-pltz4\" cannot be scheduled: 0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.. Make sure you have sufficient worker nodes.), LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: INSTANCE_IN_MULTIPLE_LOAD_BALANCED_IGS - Validation failed for instance 'projects/openshift-qe/zones/us-central1-a/instances/jiwei-1108-priv-kx7b4-master-0': instance may belong to at most one load-balanced instance group.\nThe kube-controller-manager logs may contain more details.)"} ...output omitted... 2023-11-08T15:13:41.323Z ERROR operator.ingress_controller controller/controller.go:118 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-b/instances/jiwei-1108-priv-kx7b4-worker-b-84bx5' is expected to be in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/jiwei-1108-priv-master-subnet' but is in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/jiwei-1108-priv-worker-subnet'., wrongSubnetwork\nThe kube-controller-manager logs may contain more details.), CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"} ...output omitted... $ Must-gather https://drive.google.com/file/d/1zwhJ4ga0-tQuRorha4XnUGUKbSTx1fx4/view?usp=drive_link