-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.14
-
No
-
False
-
-
-
Bug Fix
-
Done
After creating a 4.14 ARO cluster, some cluster operators are not available because load balancer can't be created.
It is because of the change of the default value of vmType in cloud-provider-azure.
https://github.com/kubernetes-sigs/cloud-provider-azure/pull/4214
In ARO, we use standard vmType and don't use any vmss as a cluster node, but installer doesn't specify vmType, which causes vmType mismatch and cloud-provider-azure can't configure load balancer.
We would like it to make vmType default `standard` or to have an option to change it via install config or something.
discussion thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1700814868246649
Reproducible steps:
Create an 4.14 ARO cluster. Creating a normal cluster with standard vm in Azure might also reproduce the issue
What I got:
❯ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.1 False True True 21m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.atokubi.eastus.osadev.cloud/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)... cloud-controller-manager 4.14.1 True False False 24m cloud-credential 4.14.1 True False False 26m cluster-autoscaler 4.14.1 True False False 20m config-operator 4.14.1 True False False 21m console 4.14.1 False True False 13m DeploymentAvailable: 0 replicas available for console deployment... control-plane-machine-set 4.14.1 True False False 14m csi-snapshot-controller 4.14.1 True False False 20m dns 4.14.1 True False False 20m etcd 4.14.1 True False False 19m image-registry 4.14.1 True False False 8m11s ingress False True True 7m36s The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0... insights 4.14.1 True False False 14m kube-apiserver 4.14.1 True True False 10m NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 6 kube-controller-manager 4.14.1 True False False 18m kube-scheduler 4.14.1 True False False 17m kube-storage-version-migrator 4.14.1 True False False 21m machine-api 4.14.1 True False False 11m machine-approver 4.14.1 True False False 20m machine-config 4.14.1 True False False 15m marketplace 4.14.1 True False False 20m monitoring 4.14.1 True False False 6m53s network 4.14.1 True False False 22m node-tuning 4.14.1 True False False 20m openshift-apiserver 4.14.1 True False False 14m openshift-controller-manager 4.14.1 True False False 20m openshift-samples 4.14.1 True False False 14m operator-lifecycle-manager 4.14.1 True False False 20m operator-lifecycle-manager-catalog 4.14.1 True False False 20m operator-lifecycle-manager-packageserver 4.14.1 True False False 14m service-ca 4.14.1 True False False 21m storage 4.14.1 True False False 20m
❯ oc get svc -A | grep LoadBalancer
openshift-ingress router-default LoadBalancer 172.30.43.24 <pending> 80:32538/TCP,443:31115/TCP 38m
❯ oc get cm cloud-provider-config -n openshift-config -oyaml apiVersion: v1 data: config: '{"cloud":"AzurePublicCloud","tenantId":"<reducted>","aadClientId":"","aadClientSecret":"","aadClientCertPath":"","aadClientCertPassword":"","useManagedIdentityExtension":false,"userAssignedIdentityID":"","subscriptionId":"<reducted>","resourceGroup":"aro-atokubi","location":"eastus","vnetName":"dev-vnet","vnetResourceGroup":"v4-eastus","subnetName":"atokubi-worker","securityGroupName":"atokubi-vnkt5-nsg","routeTableName":"atokubi-vnkt5-node-routetable","primaryAvailabilitySetName":"","vmType":"","primaryScaleSetName":"","cloudProviderBackoff":true,"cloudProviderBackoffRetries":0,"cloudProviderBackoffExponent":0,"cloudProviderBackoffDuration":6,"cloudProviderBackoffJitter":0,"cloudProviderRateLimit":false,"cloudProviderRateLimitQPS":0,"cloudProviderRateLimitBucket":0,"cloudProviderRateLimitQPSWrite":0,"cloudProviderRateLimitBucketWrite":0,"useInstanceMetadata":true,"loadBalancerSku":"standard","excludeMasterFromStandardLB":false,"disableOutboundSNAT":true,"maximumLoadBalancerRuleCount":0}' kind: ConfigMap metadata: creationTimestamp: "2023-11-29T10:08:19Z" name: cloud-provider-config namespace: openshift-config resourceVersion: "33363" uid: 8b35cf3f-65ee-428d-92e6-304165301e96
❯ oc logs azure-cloud-controller-manager-fbdfbdb86-hk646 -n openshift-cloud-controller-manager Defaulted container "cloud-controller-manager" out of: cloud-controller-manager, azure-inject-credentials (init) <omitted> I1129 10:46:47.401672 1 controller.go:388] Ensuring load balancer for service openshift-ingress/router-default I1129 10:46:47.401732 1 azure_loadbalancer.go:122] reconcileService: Start reconciling Service "openshift-ingress/router-default" with its resource basename "ac376ce0f66164eebb9fc0fa76a9c697" I1129 10:46:47.401742 1 azure_loadbalancer.go:1533] reconcileLoadBalancer for service(openshift-ingress/router-default) - wantLb(true): started I1129 10:46:47.401849 1 event.go:307] "Event occurred" object="openshift-ingress/router-default" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer" I1129 10:46:47.505374 1 azure_loadbalancer_repo.go:73] LoadBalancerClient.List(aro-atokubi) success I1129 10:46:47.573290 1 azure_loadbalancer.go:1557] reconcileLoadBalancer for service(openshift-ingress/router-default): lb(aro-atokubi/atokubi-vnkt5) wantLb(true) resolved load balancer name I1129 10:46:47.643053 1 azure_vmssflex_cache.go:162] Could not find node () in the existing cache. Forcely freshing the cache to check again... E1129 10:46:47.716774 1 azure_vmssflex.go:379] fs.GetNodeNameByIPConfigurationID(/subscriptions/fe16a035-e540-4ab7-80d9-373fa9a3d6ae/resourceGroups/aro-atokubi/providers/Microsoft.Network/networkInterfaces/atokubi-vnkt5-master0-nic/ipConfigurations/pipConfig) failed. Error: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0 E1129 10:46:47.716802 1 azure_loadbalancer.go:126] reconcileLoadBalancer(openshift-ingress/router-default) failed: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0 I1129 10:46:47.716835 1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=0.315082823 request="services_ensure_loadbalancer" resource_group="aro-atokubi" subscription_id="fe16a035-e540-4ab7-80d9-373fa9a3d6ae" source="openshift-ingress/router-default" result_code="failed_ensure_loadbalancer" E1129 10:46:47.716866 1 controller.go:291] error processing service openshift-ingress/router-default (will retry): failed to ensure load balancer: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0 I1129 10:46:47.716964 1 event.go:307] "Event occurred" object="openshift-ingress/router-default" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0"
After changing vmType from empty to "standard" in cloud-provider-config, it can configure load balancer and errors are gone.
- blocks
-
OCPBUGS-24521 [4.14] Load balancers are not created in ARO
- Closed
- is cloned by
-
OCPBUGS-24521 [4.14] Load balancers are not created in ARO
- Closed
- is related to
-
OCPCLOUD-2409 Impact [4.14] LB not getting External-IP
- Closed
- relates to
-
OCPBUGS-25483 LB not getting External-IP
- Closed
- links to
-
RHSA-2023:7198 OpenShift Container Platform 4.15 security update