Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-16336

Missing BP of LoadBalancerServiceHasNodePortAllocation into 4.12 causes problems with flow creation for these services

XMLWordPrintable

    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      ==== Support for allocateLoadBalancerNodePorts in Service object of Network API
      The `ServiceSpec` component in Network API under `Service` object describes the attributes that a user creates on a service. The `allocateLoadBalancerNodePorts` attribute within `ServiceSpec` is now supported from {product-title} 4.12.28 release onwards. The `allocateLoadBalancerNodePorts` attribute defines whether the `NodePorts` will be automatically allocated for services with type `LoadBalancer`.
      Show
      ==== Support for allocateLoadBalancerNodePorts in Service object of Network API The `ServiceSpec` component in Network API under `Service` object describes the attributes that a user creates on a service. The `allocateLoadBalancerNodePorts` attribute within `ServiceSpec` is now supported from {product-title} 4.12.28 release onwards. The `allocateLoadBalancerNodePorts` attribute defines whether the `NodePorts` will be automatically allocated for services with type `LoadBalancer`.
    • Feature

      TL;DR

      4.12 requires backport of commit:

      commit 0111e1faec20d16505a110449966273b430b7ad1
      Author: Surya Seetharaman <suryaseetharaman.9@gmail.com>
      Date:   Tue Sep 6 21:20:57 2022 +0200
      
          Support AllocateLoadBalancerNodePortsFalse
          
          This PR supports having allocateloadbalancernodeports
          set to false along with etp=local on lgw mode.
          
          Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
      

      Analysis

      Missing BP of LoadBalancerServiceHasNodePortAllocation into 4.12 causes problems with flow creation for these services, even in shared gateway mode

      This issue affects services with `allocateLoadBalancerNodePorts: false` in OCP 4.12.

      Any deletion of services with `allocateLoadBalancerNodePorts: false` will fail and go into a 15 minute long retry loop. When one recreates a service while a failed deletion is still in progress, the flows on br-ex are not recreated.

      Deletion will fail with:

      (...)
      obj_retry.go:257] Retry object setup: *factory.serviceForGateway <ns>/<service>
      obj_retry.go:290] Removing old object: *factory.serviceForGateway <ns>/<service> (failed: %!s(uint8=<retry>))
      (...)
      obj_retry.go: 298] Retry delete failed for *factory.serviceForGateway <ns><service>, will try again later: error removing port claim for service: <ns>/<service>: invalid service port <service>, err: invalid port number: 0
      

      And while a deletion is still ongoing, add will fail with:

      obj_retry.go: 476] Failed to delete old object <ns>/<service> of type *factory.serviceForGateway, during add event: error removing port claim for service: <ns>/<service>: invalid service port <service>, err: invalid port number: 0
      

      onvkube-node will retry 15 times with a 1 minute backoff before it gives up, and while this fails, the object cannot be recreated.

      That also means that there are currently 2 workarounds for this (tested):

      • restart all ovnkube-node pods --> this will get rid of the bad cache entries and recreate the br-ex flows
      • delete the service. Wait for +15 minutes (until you no longer see the error message about failed deletion and retries) and recreate the service

      The problem can easily be reproduced in 4.12, I tested this on 4.12.17 with SNO:

      $ cat fedora-test.yaml 
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: fedora-service
        labels:
          app: fedora-deployment
      spec:
        selector:
          app: fedora-pod
        ports:
          - protocol: TCP
            port: 80
            targetPort: 8080
        sessionAffinity: None
        type: LoadBalancer
        allocateLoadBalancerNodePorts: false
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: fedora-deployment
        labels:
          app: fedora-deployment
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: fedora-pod
        template:
          metadata:
            labels:
              app: fedora-pod
          spec:
            containers:
            - name: fedora-a
              image: registry.fedoraproject.org/fedora:latest
              imagePullPolicy: Always
              command:
              - sleep
              - infinity
            - name: fedora-b
              image: registry.fedoraproject.org/fedora:latest
              imagePullPolicy: Always
              command:
              - sleep
              - infinity
      
      oc apply -f fedora-test.yaml
      oc delete svc fedora-service
      oc apply -f fedora-test.yaml
      

      Logs:

      oc logs -n openshift-ovn-kubernetes ovnkube-node-4xg6w -c ovnkube-node -f | grep fedora-service
      (...)
      I0714 01:59:30.867309    9291 obj_retry.go:491] Creating *factory.serviceForGateway default/fedora-service took: 70.803µs
      I0714 01:59:30.875170    9291 obj_retry.go:491] Creating *factory.endpointSliceForGateway default/fedora-service-5bmf8 took: 15.941µs
      I0714 01:59:30.875210    9291 obj_retry.go:491] Creating *factory.endpointSliceForStaleConntrackRemoval default/fedora-service-5bmf8 took: 169ns
      E0714 01:59:52.496754    9291 obj_retry.go:673] Failed to delete *factory.serviceForGateway default/fedora-service, error: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0
      I0714 02:00:02.969493    9291 obj_retry.go:471] Detected stale object during new object add of type *factory.serviceForGateway with the same key: default/fedora-service
      W0714 02:00:02.969523    9291 gateway_shared_intf.go:656] Delete service: no service found in cache for endpoint fedora-service in namespace default
      I0714 02:00:02.971917    9291 obj_retry.go:491] Creating *factory.endpointSliceForGateway default/fedora-service-74vf8 took: 62.416µs
      I0714 02:00:02.971926    9291 obj_retry.go:491] Creating *factory.endpointSliceForStaleConntrackRemoval default/fedora-service-74vf8 took: 255ns
      E0714 02:00:03.086557    9291 obj_retry.go:476] Failed to delete old object default/fedora-service of type *factory.serviceForGateway, during add event: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0
      I0714 02:00:13.982590    9291 obj_retry.go:257] Retry object setup: *factory.serviceForGateway default/fedora-service
      I0714 02:00:13.982621    9291 obj_retry.go:290] Removing old object: *factory.serviceForGateway default/fedora-service (failed: %!s(uint8=1))
      I0714 02:00:14.104772    9291 obj_retry.go:298] Retry delete failed for *factory.serviceForGateway default/fedora-service, will try again later: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0
      I0714 02:00:22.397338    9291 obj_retry.go:571] Found retry entry for *factory.serviceForGateway default/fedora-service marked for deletion: will delete the object
      W0714 02:00:22.397400    9291 gateway_shared_intf.go:656] Delete service: no service found in cache for endpoint fedora-service in namespace default
      E0714 02:00:22.601603    9291 obj_retry.go:575] Failed to delete stale object default/fedora-service, during update: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0
      I0714 02:00:43.980921    9291 obj_retry.go:257] Retry object setup: *factory.serviceForGateway default/fedora-service
      I0714 02:00:43.980948    9291 obj_retry.go:290] Removing old object: *factory.serviceForGateway default/fedora-service (failed: %!s(uint8=1))
      W0714 02:00:43.980976    9291 gateway_shared_intf.go:656] Delete service: no service found in cache for endpoint fedora-service in namespace default
      I0714 02:00:44.199215    9291 obj_retry.go:298] Retry delete failed for *factory.serviceForGateway default/fedora-service, will try again later: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0
      

      And the following watch shows that the flows are created initially, then upon deletion the flows vanish, then as the service is recreated the flows do not reappear:

      watch "ovs-ofctl dump-flows br-ex | grep 192.168.18.100"
      

      I can delete the ovnkube-node pod to recreate the flows:

      oc delete pod -n openshift-ovn-kubernetes ovnkube-node-4xg6w
      

      And the flows reappaer:

      [root@sno ~]# ovs-ofctl dump-flows br-ex | grep 192.168.18.100 
       cookie=0x849b956ca97beaee, duration=27.925s, table=0, n_packets=0, n_bytes=0, idle_age=27, priority=110,arp,in_port=1,arp_tpa=192.168.18.100,arp_op=1 actions=LOCAL
       cookie=0x849b956ca97beaee, duration=27.925s, table=0, n_packets=0, n_bytes=0, idle_age=27, priority=110,tcp,in_port=1,nw_dst=192.168.18.100,tp_dst=80 actions=output:2
       cookie=0x849b956ca97beaee, duration=27.925s, table=0, n_packets=0, n_bytes=0, idle_age=27, priority=110,tcp,in_port=2,nw_src=192.168.18.100,tp_src=80 actions=output:1
      

      --------------------------------------

      The problem does not manifest in 4.13. The difference between 4.12 an 4.13 is a missing backport of 0111e1faec20d16505a110449966273b430b7ad1

      Log for service deletion in OCP 4.13:

      I0718 13:27:35.699982  334002 obj_retry.go:656] Delete event received for *factory.serviceForGateway default/fedora-service
      I0718 13:27:35.700010  334002 gateway_shared_intf.go:679] Deleting service fedora-service in namespace default
      I0718 13:27:35.769565  334002 obj_retry.go:656] Delete event received for *factory.endpointSliceForGateway default/fedora-service-6hhds
      I0718 13:27:35.769596  334002 gateway_shared_intf.go:856] Deleting endpointslice fedora-service-6hhds in namespace default
      I0718 13:27:35.769610  334002 gateway_shared_intf.go:431] No serviceConfig found for service fedora-service in namespace default
      I0718 13:27:35.769618  334002 obj_retry.go:656] Delete event received for *factory.endpointSliceForStaleConntrackRemoval default/fedora-service-6hhds
      

      Log for service deletion in OCP 4.12:

      I0718 13:28:14.253695   52007 obj_retry.go:653] Delete event received for *factory.serviceForGateway default/fedora-service
      I0718 13:28:14.253717   52007 port_claim.go:197] Handle NodePort service fedora-service port 0
      I0718 13:28:14.253726   52007 gateway_shared_intf.go:649] Deleting service fedora-service in namespace default
      I0718 13:28:14.288844   52007 obj_retry.go:653] Delete event received for *factory.endpointSliceForGateway default/fedora-service-2m857
      I0718 13:28:14.288870   52007 gateway_shared_intf.go:817] Deleting endpointslice fedora-service-2m857 in namespace default
      I0718 13:28:14.288876   52007 gateway_shared_intf.go:407] No serviceConfig found for service fedora-service in namespace default
      I0718 13:28:14.288881   52007 obj_retry.go:653] Delete event received for *factory.endpointSliceForStaleConntrackRemoval default/fedora-service-2m857
      E0718 13:28:14.402407   52007 obj_retry.go:673] Failed to delete *factory.serviceForGateway default/fedora-service, error: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0
      

      Both 4.12 and 4.13 have similar code, and `handleService` looks the same as well:

        189 func handleService(svc *kapi.Service, handler handler) []error {                                                        
        190     errors := []error{}                                                                                                 
        191     if !util.ServiceTypeHasNodePort(svc) && len(svc.Spec.ExternalIPs) == 0 {                                            
        192         return errors                                                                                                   
        193     }                                                                                                                   
        194                                                                                                                         
        195     for _, svcPort := range svc.Spec.Ports {                                                                            
        196         if util.ServiceTypeHasNodePort(svc) {                                                                           
        197             klog.V(5).Infof("Handle NodePort service %s port %d", svc.Name, svcPort.NodePort) 
      

      But ServiceTypeHasNodePort in 4.13 correctly differentiates between allocateLoadBalancerNodePorts whereas 4.12 does not:

      go-controller/pkg/util/kube.go

        273 func LoadBalancerServiceHasNodePortAllocation(service *kapi.Service) bool {                                             
        274     return service.Spec.AllocateLoadBalancerNodePorts == nil || *service.Spec.AllocateLoadBalancerNodePorts             
        275 }   
      
        277 // ServiceTypeHasNodePort checks if the service has an associated NodePort or not                                       
        278 func ServiceTypeHasNodePort(service *kapi.Service) bool {                                                               
        279     return service.Spec.Type == kapi.ServiceTypeNodePort ||                                                             
        280         (service.Spec.Type == kapi.ServiceTypeLoadBalancer && LoadBalancerServiceHasNodePortAllocation(service))        
        281 }
      

      In OCP 4.12:

        221 // ServiceTypeHasNodePort checks if the service has an associated NodePort or not                                       
        222 func ServiceTypeHasNodePort(service *kapi.Service) bool {                                                               
        223     return service.Spec.Type == kapi.ServiceTypeNodePort || service.Spec.Type == kapi.ServiceTypeLoadBalancer           
        224 }
      

            akaris@redhat.com Andreas Karis
            akaris@redhat.com Andreas Karis
            Arti Sood Arti Sood
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: