Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-24521

[4.14] Load balancers are not created in ARO

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-24191. The following is the description of the original issue:

      After creating a 4.14 ARO cluster, some cluster operators are not available because load balancer can't be created.

      It is because of the change of the default value of vmType in cloud-provider-azure.

      https://github.com/kubernetes-sigs/cloud-provider-azure/pull/4214

      In ARO, we use standard vmType and don't use any vmss as a cluster node, but installer doesn't specify vmType, which causes vmType mismatch and cloud-provider-azure can't configure load balancer.

      https://github.com/openshift/installer/blob/release-4.14/pkg/asset/manifests/azure/cloudproviderconfig.go

      We would like it to make vmType default `standard` or to have an option to change it via install config or something.

      discussion thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1700814868246649

       

      Reproducible steps:

      Create an 4.14 ARO cluster.
      Creating a normal cluster with standard vm in Azure might also reproduce the issue
      

      What I got:

      ❯ oc get co
      NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.14.1    False       True          True       21m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.atokubi.eastus.osadev.cloud/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)...
      cloud-controller-manager                   4.14.1    True        False         False      24m
      cloud-credential                           4.14.1    True        False         False      26m
      cluster-autoscaler                         4.14.1    True        False         False      20m
      config-operator                            4.14.1    True        False         False      21m
      console                                    4.14.1    False       True          False      13m     DeploymentAvailable: 0 replicas available for console deployment...
      control-plane-machine-set                  4.14.1    True        False         False      14m
      csi-snapshot-controller                    4.14.1    True        False         False      20m
      dns                                        4.14.1    True        False         False      20m
      etcd                                       4.14.1    True        False         False      19m
      image-registry                             4.14.1    True        False         False      8m11s
      ingress                                              False       True          True       7m36s   The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0...
      insights                                   4.14.1    True        False         False      14m
      kube-apiserver                             4.14.1    True        True          False      10m     NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 6
      kube-controller-manager                    4.14.1    True        False         False      18m
      kube-scheduler                             4.14.1    True        False         False      17m
      kube-storage-version-migrator              4.14.1    True        False         False      21m
      machine-api                                4.14.1    True        False         False      11m
      machine-approver                           4.14.1    True        False         False      20m
      machine-config                             4.14.1    True        False         False      15m
      marketplace                                4.14.1    True        False         False      20m
      monitoring                                 4.14.1    True        False         False      6m53s
      network                                    4.14.1    True        False         False      22m
      node-tuning                                4.14.1    True        False         False      20m
      openshift-apiserver                        4.14.1    True        False         False      14m
      openshift-controller-manager               4.14.1    True        False         False      20m
      openshift-samples                          4.14.1    True        False         False      14m
      operator-lifecycle-manager                 4.14.1    True        False         False      20m
      operator-lifecycle-manager-catalog         4.14.1    True        False         False      20m
      operator-lifecycle-manager-packageserver   4.14.1    True        False         False      14m
      service-ca                                 4.14.1    True        False         False      21m
      storage                                    4.14.1    True        False         False      20m 
      ❯ oc get svc -A | grep LoadBalancer
      openshift-ingress                                  router-default                             LoadBalancer   172.30.43.24     <pending>                              80:32538/TCP,443:31115/TCP                38m
      
      ❯ oc get cm cloud-provider-config -n openshift-config -oyaml
      apiVersion: v1
      data:
        config: '{"cloud":"AzurePublicCloud","tenantId":"<reducted>","aadClientId":"","aadClientSecret":"","aadClientCertPath":"","aadClientCertPassword":"","useManagedIdentityExtension":false,"userAssignedIdentityID":"","subscriptionId":"<reducted>","resourceGroup":"aro-atokubi","location":"eastus","vnetName":"dev-vnet","vnetResourceGroup":"v4-eastus","subnetName":"atokubi-worker","securityGroupName":"atokubi-vnkt5-nsg","routeTableName":"atokubi-vnkt5-node-routetable","primaryAvailabilitySetName":"","vmType":"","primaryScaleSetName":"","cloudProviderBackoff":true,"cloudProviderBackoffRetries":0,"cloudProviderBackoffExponent":0,"cloudProviderBackoffDuration":6,"cloudProviderBackoffJitter":0,"cloudProviderRateLimit":false,"cloudProviderRateLimitQPS":0,"cloudProviderRateLimitBucket":0,"cloudProviderRateLimitQPSWrite":0,"cloudProviderRateLimitBucketWrite":0,"useInstanceMetadata":true,"loadBalancerSku":"standard","excludeMasterFromStandardLB":false,"disableOutboundSNAT":true,"maximumLoadBalancerRuleCount":0}'
      kind: ConfigMap
      metadata:
        creationTimestamp: "2023-11-29T10:08:19Z"
        name: cloud-provider-config
        namespace: openshift-config
        resourceVersion: "33363"
        uid: 8b35cf3f-65ee-428d-92e6-304165301e96 
      ❯ oc logs azure-cloud-controller-manager-fbdfbdb86-hk646 -n openshift-cloud-controller-manager
      Defaulted container "cloud-controller-manager" out of: cloud-controller-manager, azure-inject-credentials (init)
      <omitted>
      I1129 10:46:47.401672       1 controller.go:388] Ensuring load balancer for service openshift-ingress/router-default
      I1129 10:46:47.401732       1 azure_loadbalancer.go:122] reconcileService: Start reconciling Service "openshift-ingress/router-default" with its resource basename "ac376ce0f66164eebb9fc0fa76a9c697"
      I1129 10:46:47.401742       1 azure_loadbalancer.go:1533] reconcileLoadBalancer for service(openshift-ingress/router-default) - wantLb(true): started
      I1129 10:46:47.401849       1 event.go:307] "Event occurred" object="openshift-ingress/router-default" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
      I1129 10:46:47.505374       1 azure_loadbalancer_repo.go:73] LoadBalancerClient.List(aro-atokubi) success
      I1129 10:46:47.573290       1 azure_loadbalancer.go:1557] reconcileLoadBalancer for service(openshift-ingress/router-default): lb(aro-atokubi/atokubi-vnkt5) wantLb(true) resolved load balancer name
      I1129 10:46:47.643053       1 azure_vmssflex_cache.go:162] Could not find node () in the existing cache. Forcely freshing the cache to check again...
      E1129 10:46:47.716774       1 azure_vmssflex.go:379] fs.GetNodeNameByIPConfigurationID(/subscriptions/fe16a035-e540-4ab7-80d9-373fa9a3d6ae/resourceGroups/aro-atokubi/providers/Microsoft.Network/networkInterfaces/atokubi-vnkt5-master0-nic/ipConfigurations/pipConfig) failed. Error: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0
      E1129 10:46:47.716802       1 azure_loadbalancer.go:126] reconcileLoadBalancer(openshift-ingress/router-default) failed: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0
      I1129 10:46:47.716835       1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=0.315082823 request="services_ensure_loadbalancer" resource_group="aro-atokubi" subscription_id="fe16a035-e540-4ab7-80d9-373fa9a3d6ae" source="openshift-ingress/router-default" result_code="failed_ensure_loadbalancer"
      E1129 10:46:47.716866       1 controller.go:291] error processing service openshift-ingress/router-default (will retry): failed to ensure load balancer: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0
      I1129 10:46:47.716964       1 event.go:307] "Event occurred" object="openshift-ingress/router-default" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0"
      

       

      After changing vmType from empty to "standard" in cloud-provider-config, it can configure load balancer and errors are gone.

       

            Unassigned Unassigned
            openshift-crt-jira-prow OpenShift Prow Bot
            Mike Gahagan Mike Gahagan
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: