• Important
    • No
    • CLOUD Sprint 247
    • 1
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the default VM type for the {azure-short} load balancer was changed from `Standard` to `VMSS`, but the service type load balancer code could not attach standard VMs to load balancers.
      With this release, the default VM type is reverted to remain compatible with {product-title} deployments.
      (link:https://issues.redhat.com/browse/OCPBUGS-25483[*OCPBUGS-25483*])
      Show
      * Previously, the default VM type for the {azure-short} load balancer was changed from `Standard` to `VMSS`, but the service type load balancer code could not attach standard VMs to load balancers. With this release, the default VM type is reverted to remain compatible with {product-title} deployments. (link: https://issues.redhat.com/browse/OCPBUGS-25483 [* OCPBUGS-25483 *])
    • Bug Fix
    • Done

      Description of problem:

      A regression was identified creating LoadBalancer services in ARO in new 4.14 clusters (handled for new installations in OCPBUGS-24191)
      
      The same regression has been also confirmed in ARO clusters upgraded to 4.14

      Version-Release number of selected component (if applicable):

      4.14.z

      How reproducible:

      On any ARO cluster upgraded to 4.14.z    

      Steps to Reproduce:

          1. Install an ARO cluster
          2. Upgrade to 4.14 from fast channel
          3. oc create svc loadbalancer test-lb -n default --tcp 80:8080

      Actual results:

      # External-IP stuck in Pending
      $ oc get svc test-lb -n default
      NAME      TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
      test-lb   LoadBalancer   172.30.104.200   <pending>     80:30062/TCP   15m
      
      
      # Errors in cloud-controller-manager being unable to map VM to nodes
      $ oc logs -l infrastructure.openshift.io/cloud-controller-manager=Azure  -n openshift-cloud-controller-manager
      I1215 19:34:51.843715       1 azure_loadbalancer.go:1533] reconcileLoadBalancer for service(default/test-lb) - wantLb(true): started
      I1215 19:34:51.844474       1 event.go:307] "Event occurred" object="default/test-lb" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
      I1215 19:34:52.253569       1 azure_loadbalancer_repo.go:73] LoadBalancerClient.List(aro-r5iks3dh) success
      I1215 19:34:52.253632       1 azure_loadbalancer.go:1557] reconcileLoadBalancer for service(default/test-lb): lb(aro-r5iks3dh/mabad-test-74km6) wantLb(true) resolved load balancer name
      I1215 19:34:52.528579       1 azure_vmssflex_cache.go:162] Could not find node () in the existing cache. Forcely freshing the cache to check again...
      E1215 19:34:52.714678       1 azure_vmssflex.go:379] fs.GetNodeNameByIPConfigurationID(/subscriptions/fe16a035-e540-4ab7-80d9-373fa9a3d6ae/resourceGroups/aro-r5iks3dh/providers/Microsoft.Network/networkInterfaces/mabad-test-74km6-master0-nic/ipConfigurations/pipConfig) failed. Error: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0
      E1215 19:34:52.714888       1 azure_loadbalancer.go:126] reconcileLoadBalancer(default/test-lb) failed: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0
      I1215 19:34:52.714956       1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=0.871261893 request="services_ensure_loadbalancer" resource_group="aro-r5iks3dh" subscription_id="fe16a035-e540-4ab7-80d9-373fa9a3d6ae" source="default/test-lb" result_code="failed_ensure_loadbalancer"
      E1215 19:34:52.715005       1 controller.go:291] error processing service default/test-lb (will retry): failed to ensure load balancer: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0

      Expected results:

      # The LoadBalancer gets an External-IP assigned
      $ oc get svc test-lb -n default 
      NAME         TYPE           CLUSTER-IP       EXTERNAL-IP                            PORT(S)        AGE 
      test-lb      LoadBalancer   172.30.193.159   20.242.180.199                         80:31475/TCP   14s

      Additional info:

      In cloud-provider-config cm in openshift-config namespace, vmType=""
      
      When vmType gets changed to "standard" explicitly, the provisioning of the LoadBalancer completes and an ExternalIP gets assigned without errors.

            [OCPBUGS-25483] LB not getting External-IP

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2024:0041

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:0041

            Zhaohua Sun added a comment -

            Verified clusterversion 4.16.0-0.nightly-2024-01-05-154400

            Set up a 4.13 cluster in ARO and upgrade to 4.14, reproduce this issue.

            $ oc get svc -n default                                                                      
            NAME         TYPE           CLUSTER-IP       EXTERNAL-IP                            PORT(S)        AGE
            kubernetes   ClusterIP      172.30.0.1       <none>                                 443/TCP        6h
            openshift    ExternalName   <none>           kubernetes.default.svc.cluster.local   <none>         5h55m
            test-lb      LoadBalancer   172.30.237.164   <pending>                              80:31024/TCP   2m20s
            $ oc get clusterversion                                                                               
            NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.14.8    True        False         49m     Cluster version is 4.14.8 

            Upgrade cluster to 4.15 to 4.16, LB can get external-ip.

            $ oc get svc -n default                                                                               [22:59:53]
            NAME         TYPE           CLUSTER-IP       EXTERNAL-IP                            PORT(S)        AGE
            kubernetes   ClusterIP      172.30.0.1       <none>                                 443/TCP        11h
            openshift    ExternalName   <none>           kubernetes.default.svc.cluster.local   <none>         10h
            test-lb      LoadBalancer   172.30.237.164   20.80.37.78                            80:31024/TCP   5h1m
            test-lb1     LoadBalancer   172.30.231.162   20.241.93.188                          80:30521/TCP   17s 

            Zhaohua Sun added a comment - Verified clusterversion 4.16.0-0.nightly-2024-01-05-154400 Set up a 4.13 cluster in ARO and upgrade to 4.14, reproduce this issue. $ oc get svc -n default                                                                       NAME         TYPE           CLUSTER-IP       EXTERNAL-IP                            PORT(S)        AGE kubernetes   ClusterIP      172.30.0.1       <none>                                 443/TCP        6h openshift    ExternalName   <none>           kubernetes. default .svc.cluster.local   <none>         5h55m test-lb      LoadBalancer   172.30.237.164   <pending>                              80:31024/TCP   2m20s $ oc get clusterversion                                                                               NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS version   4.14.8    True        False         49m     Cluster version is 4.14.8 Upgrade cluster to 4.15 to 4.16, LB can get external-ip. $ oc get svc -n default                                                                               [22:59:53] NAME         TYPE           CLUSTER-IP       EXTERNAL-IP                            PORT(S)        AGE kubernetes   ClusterIP      172.30.0.1       <none>                                 443/TCP        11h openshift    ExternalName   <none>           kubernetes. default .svc.cluster.local   <none>         10h test-lb      LoadBalancer   172.30.237.164   20.80.37.78                            80:31024/TCP   5h1m test-lb1     LoadBalancer   172.30.231.162   20.241.93.188                          80:30521/TCP   17s

            Hi rh-ee-tbarberb,

            Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            OpenShift Jira Bot added a comment - Hi rh-ee-tbarberb , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            I've elevated Priority to Major and set Target Backport Versions: 4.14, because we've declaring an AzureDefaultVMType update risk to warn exposed clusters about this issue for 4.13-to-4.14 updates, and we don't like declared update risks to remain unfixed for too long.

            W. Trevor King added a comment - I've elevated Priority to Major and set Target Backport Versions: 4.14 , because we've declaring an AzureDefaultVMType update risk to warn exposed clusters about this issue for 4.13-to-4.14 updates, and we don't like declared update risks to remain unfixed for too long.

            > do you know whether setting the VM type to standard works with the new naming format as well?

            It works for both the old and new naming format, at least I confirmed it for ARO.

            I looked at the PR of CCCMO and I think it would work.

            Ayato Tokubi added a comment - > do you know whether setting the VM type to standard works with the new naming format as well? It works for both the old and new naming format, at least I confirmed it for ARO. I looked at the PR of CCCMO and I think it would work.

            Joel Speed added a comment -

            rh-ee-atokubi Since you've already been investigating this issue, do you know whether setting the VM type to standard works with the new naming format as well?

            I'm wondering if it's possible to just force the type to standard for all openshift clusters, which will then be applicable to any generation.

            Note, I think this is similar to OCPBUGS-20213 which we already fixed (and this bug would be fixed in a very similar way), however at the time we only understood the issue to exist on Azure stack

            Joel Speed added a comment - rh-ee-atokubi Since you've already been investigating this issue, do you know whether setting the VM type to standard works with the new naming format as well? I'm wondering if it's possible to just force the type to standard for all openshift clusters, which will then be applicable to any generation. Note, I think this is similar to OCPBUGS-20213 which we already fixed (and this bug would be fixed in a very similar way), however at the time we only understood the issue to exist on Azure stack

            Ayato Tokubi added a comment - - edited

            TL;DR

            After investigation I found that it is caused by the discrepancy of the VM name and the NIC name.
            All ARO clusters and non-ARO Azure clusters created before 4.9 have this discrepancy, and they will be affected by the issue.
            https://github.com/openshift/installer/pull/5082/files

            cloud-provider-azure detects and changes its behaviour about load balancers according to the vmManagementType.
            https://github.com/openshift/cloud-provider-azure/blob/release-4.14/pkg/provider/azure_vmss.go#L1593-L1606

            vmManagementType is determined by finding the vmName in a cached vm list.
            vmName is retrieved from ipConfigurationID by regex.

            For my test cluster, ipConfigurationID was 

            /subscriptions/fe16a035-e540-4ab7-80d9-373fa9a3d6ae/resourceGroups/aro-atokubi/providers/Microsoft.Network/networkInterfaces/atokubi-kxb2f-master0-nic/ipConfigurations/pipConfig 

            thus vmName was "atokubi-kxb2f-master0" .
            https://github.com/openshift/cloud-provider-azure/blob/release-4.14/pkg/provider/azure_vmss_cache.go#L487-L497
            https://github.com/openshift/cloud-provider-azure/blob/release-4.14/pkg/provider/azure_standard.go#L52

             

            However the cached vm list is retrieved by calling azure API directly.

            For my test cluster, the element of the list was like "atokubi-kxb2f-master-0", which has "-" between "master" and the number.
            https://github.com/openshift/cloud-provider-azure/blob/release-4.14/pkg/provider/azure_vmss_cache.go#L341

            Because of this discrepancy ("atokubi-kxb2f-master-0" and "atokubi-kxb2f-master0" ), cloud-provider-azure couldn't find vmName in the cached list and regarded the vmtype as vmssflex.
            https://github.com/openshift/cloud-provider-azure/blob/release-4.14/pkg/provider/azure_vmss_cache.go#L504

             

            This discrepancy is not fixed in ARO, and in the original installer it was fixed at 4.9.
            https://github.com/openshift/installer/pull/5082/files

            ❯ git branch -r --contains 9268ffea7292b69aa6c23df4078df1f6854a7372
              upstream/agent-installer
              upstream/azure-etcd-testing
              upstream/capi
              upstream/master
              upstream/ocpbugs-2144
              upstream/release-4.10
              upstream/release-4.11
              upstream/release-4.12
              upstream/release-4.13
              upstream/release-4.14
              upstream/release-4.14-azure-etcd
              upstream/release-4.15
              upstream/release-4.16
              upstream/release-4.9 

            Thus non-ARO Azure clusters created before 4.9, they might still have master VMs that have "master{num}-nic" NIC.
            Those clusters will be affected by this issue when they are upgraded, and won't be able to assign external IP to the load balancer service.

            Ayato Tokubi added a comment - - edited TL;DR After investigation I found that it is caused by the discrepancy of the VM name and the NIC name. All ARO clusters and non-ARO Azure clusters created before 4.9 have this discrepancy, and they will be affected by the issue. https://github.com/openshift/installer/pull/5082/files — cloud-provider-azure detects and changes its behaviour about load balancers according to the vmManagementType . https://github.com/openshift/cloud-provider-azure/blob/release-4.14/pkg/provider/azure_vmss.go#L1593-L1606 vmManagementType  is determined by finding the  vmName  in a cached vm list. vmName  is retrieved from  ipConfigurationID  by regex. For my test cluster,  ipConfigurationID was  /subscriptions/fe16a035-e540-4ab7-80d9-373fa9a3d6ae/resourceGroups/aro-atokubi/providers/Microsoft.Network/networkInterfaces/atokubi-kxb2f-master0-nic/ipConfigurations/pipConfig thus vmName was " atokubi-kxb2f-master0"  . https://github.com/openshift/cloud-provider-azure/blob/release-4.14/pkg/provider/azure_vmss_cache.go#L487-L497 https://github.com/openshift/cloud-provider-azure/blob/release-4.14/pkg/provider/azure_standard.go#L52   However the cached vm list is retrieved by calling azure API directly. For my test cluster, the element of the list was like "atokubi-kxb2f-master-0", which has "-" between "master" and the number. https://github.com/openshift/cloud-provider-azure/blob/release-4.14/pkg/provider/azure_vmss_cache.go#L341 Because of this discrepancy ("atokubi-kxb2f-master-0" and "atokubi-kxb2f-master0" ), cloud-provider-azure couldn't find  vmName in the cached list and regarded the vmtype as vmssflex . https://github.com/openshift/cloud-provider-azure/blob/release-4.14/pkg/provider/azure_vmss_cache.go#L504   This discrepancy is not fixed in ARO, and in the original installer it was fixed at 4.9. https://github.com/openshift/installer/pull/5082/files ❯ git branch -r --contains 9268ffea7292b69aa6c23df4078df1f6854a7372   upstream/agent-installer   upstream/azure-etcd-testing   upstream/capi   upstream/master   upstream/ocpbugs-2144   upstream/release-4.10   upstream/release-4.11   upstream/release-4.12   upstream/release-4.13   upstream/release-4.14   upstream/release-4.14-azure-etcd   upstream/release-4.15   upstream/release-4.16   upstream/release-4.9 Thus non-ARO Azure clusters created before 4.9, they might still have master VMs that have "master{num}-nic" NIC. Those clusters will be affected by this issue when they are upgraded, and won't be able to assign external IP to the load balancer service.

            This bug has an UpgradeBlocker label, so I've opened OCPCLOUD-2409 requesting an impact statement to explain exposure. Assigned to Joel, since this bug is assigned to Joel.

            W. Trevor King added a comment - This bug has an UpgradeBlocker label, so I've opened OCPCLOUD-2409 requesting an impact statement to explain exposure. Assigned to Joel, since this bug is assigned to Joel.

              rh-ee-tbarberb Theo Barber-Bany
              mabadper@redhat.com Miguel Abad Perez
              Zhaohua Sun Zhaohua Sun
              Jeana Routh Jeana Routh
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: