-
Bug
-
Resolution: Done-Errata
-
Undefined
-
None
-
4.13.z
-
None
TL;DR
4.12 requires backport of commit:
commit 0111e1faec20d16505a110449966273b430b7ad1
Author: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Date: Tue Sep 6 21:20:57 2022 +0200
Support AllocateLoadBalancerNodePortsFalse
This PR supports having allocateloadbalancernodeports
set to false along with etp=local on lgw mode.
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Analysis
Missing BP of LoadBalancerServiceHasNodePortAllocation into 4.12 causes problems with flow creation for these services, even in shared gateway mode
This issue affects services with `allocateLoadBalancerNodePorts: false` in OCP 4.12.
Any deletion of services with `allocateLoadBalancerNodePorts: false` will fail and go into a 15 minute long retry loop. When one recreates a service while a failed deletion is still in progress, the flows on br-ex are not recreated.
Deletion will fail with:
(...) obj_retry.go:257] Retry object setup: *factory.serviceForGateway <ns>/<service> obj_retry.go:290] Removing old object: *factory.serviceForGateway <ns>/<service> (failed: %!s(uint8=<retry>)) (...) obj_retry.go: 298] Retry delete failed for *factory.serviceForGateway <ns><service>, will try again later: error removing port claim for service: <ns>/<service>: invalid service port <service>, err: invalid port number: 0
And while a deletion is still ongoing, add will fail with:
obj_retry.go: 476] Failed to delete old object <ns>/<service> of type *factory.serviceForGateway, during add event: error removing port claim for service: <ns>/<service>: invalid service port <service>, err: invalid port number: 0
onvkube-node will retry 15 times with a 1 minute backoff before it gives up, and while this fails, the object cannot be recreated.
That also means that there are currently 2 workarounds for this (tested):
- restart all ovnkube-node pods --> this will get rid of the bad cache entries and recreate the br-ex flows
- delete the service. Wait for +15 minutes (until you no longer see the error message about failed deletion and retries) and recreate the service
The problem can easily be reproduced in 4.12, I tested this on 4.12.17 with SNO:
$ cat fedora-test.yaml
---
apiVersion: v1
kind: Service
metadata:
name: fedora-service
labels:
app: fedora-deployment
spec:
selector:
app: fedora-pod
ports:
- protocol: TCP
port: 80
targetPort: 8080
sessionAffinity: None
type: LoadBalancer
allocateLoadBalancerNodePorts: false
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: fedora-deployment
labels:
app: fedora-deployment
spec:
replicas: 1
selector:
matchLabels:
app: fedora-pod
template:
metadata:
labels:
app: fedora-pod
spec:
containers:
- name: fedora-a
image: registry.fedoraproject.org/fedora:latest
imagePullPolicy: Always
command:
- sleep
- infinity
- name: fedora-b
image: registry.fedoraproject.org/fedora:latest
imagePullPolicy: Always
command:
- sleep
- infinity
oc apply -f fedora-test.yaml oc delete svc fedora-service oc apply -f fedora-test.yaml
Logs:
oc logs -n openshift-ovn-kubernetes ovnkube-node-4xg6w -c ovnkube-node -f | grep fedora-service (...) I0714 01:59:30.867309 9291 obj_retry.go:491] Creating *factory.serviceForGateway default/fedora-service took: 70.803µs I0714 01:59:30.875170 9291 obj_retry.go:491] Creating *factory.endpointSliceForGateway default/fedora-service-5bmf8 took: 15.941µs I0714 01:59:30.875210 9291 obj_retry.go:491] Creating *factory.endpointSliceForStaleConntrackRemoval default/fedora-service-5bmf8 took: 169ns E0714 01:59:52.496754 9291 obj_retry.go:673] Failed to delete *factory.serviceForGateway default/fedora-service, error: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0 I0714 02:00:02.969493 9291 obj_retry.go:471] Detected stale object during new object add of type *factory.serviceForGateway with the same key: default/fedora-service W0714 02:00:02.969523 9291 gateway_shared_intf.go:656] Delete service: no service found in cache for endpoint fedora-service in namespace default I0714 02:00:02.971917 9291 obj_retry.go:491] Creating *factory.endpointSliceForGateway default/fedora-service-74vf8 took: 62.416µs I0714 02:00:02.971926 9291 obj_retry.go:491] Creating *factory.endpointSliceForStaleConntrackRemoval default/fedora-service-74vf8 took: 255ns E0714 02:00:03.086557 9291 obj_retry.go:476] Failed to delete old object default/fedora-service of type *factory.serviceForGateway, during add event: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0 I0714 02:00:13.982590 9291 obj_retry.go:257] Retry object setup: *factory.serviceForGateway default/fedora-service I0714 02:00:13.982621 9291 obj_retry.go:290] Removing old object: *factory.serviceForGateway default/fedora-service (failed: %!s(uint8=1)) I0714 02:00:14.104772 9291 obj_retry.go:298] Retry delete failed for *factory.serviceForGateway default/fedora-service, will try again later: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0 I0714 02:00:22.397338 9291 obj_retry.go:571] Found retry entry for *factory.serviceForGateway default/fedora-service marked for deletion: will delete the object W0714 02:00:22.397400 9291 gateway_shared_intf.go:656] Delete service: no service found in cache for endpoint fedora-service in namespace default E0714 02:00:22.601603 9291 obj_retry.go:575] Failed to delete stale object default/fedora-service, during update: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0 I0714 02:00:43.980921 9291 obj_retry.go:257] Retry object setup: *factory.serviceForGateway default/fedora-service I0714 02:00:43.980948 9291 obj_retry.go:290] Removing old object: *factory.serviceForGateway default/fedora-service (failed: %!s(uint8=1)) W0714 02:00:43.980976 9291 gateway_shared_intf.go:656] Delete service: no service found in cache for endpoint fedora-service in namespace default I0714 02:00:44.199215 9291 obj_retry.go:298] Retry delete failed for *factory.serviceForGateway default/fedora-service, will try again later: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0
And the following watch shows that the flows are created initially, then upon deletion the flows vanish, then as the service is recreated the flows do not reappear:
watch "ovs-ofctl dump-flows br-ex | grep 192.168.18.100"
I can delete the ovnkube-node pod to recreate the flows:
oc delete pod -n openshift-ovn-kubernetes ovnkube-node-4xg6w
And the flows reappaer:
[root@sno ~]# ovs-ofctl dump-flows br-ex | grep 192.168.18.100 cookie=0x849b956ca97beaee, duration=27.925s, table=0, n_packets=0, n_bytes=0, idle_age=27, priority=110,arp,in_port=1,arp_tpa=192.168.18.100,arp_op=1 actions=LOCAL cookie=0x849b956ca97beaee, duration=27.925s, table=0, n_packets=0, n_bytes=0, idle_age=27, priority=110,tcp,in_port=1,nw_dst=192.168.18.100,tp_dst=80 actions=output:2 cookie=0x849b956ca97beaee, duration=27.925s, table=0, n_packets=0, n_bytes=0, idle_age=27, priority=110,tcp,in_port=2,nw_src=192.168.18.100,tp_src=80 actions=output:1
--------------------------------------
The problem does not manifest in 4.13. The difference between 4.12 an 4.13 is a missing backport of 0111e1faec20d16505a110449966273b430b7ad1
Log for service deletion in OCP 4.13:
I0718 13:27:35.699982 334002 obj_retry.go:656] Delete event received for *factory.serviceForGateway default/fedora-service I0718 13:27:35.700010 334002 gateway_shared_intf.go:679] Deleting service fedora-service in namespace default I0718 13:27:35.769565 334002 obj_retry.go:656] Delete event received for *factory.endpointSliceForGateway default/fedora-service-6hhds I0718 13:27:35.769596 334002 gateway_shared_intf.go:856] Deleting endpointslice fedora-service-6hhds in namespace default I0718 13:27:35.769610 334002 gateway_shared_intf.go:431] No serviceConfig found for service fedora-service in namespace default I0718 13:27:35.769618 334002 obj_retry.go:656] Delete event received for *factory.endpointSliceForStaleConntrackRemoval default/fedora-service-6hhds
Log for service deletion in OCP 4.12:
I0718 13:28:14.253695 52007 obj_retry.go:653] Delete event received for *factory.serviceForGateway default/fedora-service I0718 13:28:14.253717 52007 port_claim.go:197] Handle NodePort service fedora-service port 0 I0718 13:28:14.253726 52007 gateway_shared_intf.go:649] Deleting service fedora-service in namespace default I0718 13:28:14.288844 52007 obj_retry.go:653] Delete event received for *factory.endpointSliceForGateway default/fedora-service-2m857 I0718 13:28:14.288870 52007 gateway_shared_intf.go:817] Deleting endpointslice fedora-service-2m857 in namespace default I0718 13:28:14.288876 52007 gateway_shared_intf.go:407] No serviceConfig found for service fedora-service in namespace default I0718 13:28:14.288881 52007 obj_retry.go:653] Delete event received for *factory.endpointSliceForStaleConntrackRemoval default/fedora-service-2m857 E0718 13:28:14.402407 52007 obj_retry.go:673] Failed to delete *factory.serviceForGateway default/fedora-service, error: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0
Both 4.12 and 4.13 have similar code, and `handleService` looks the same as well:
189 func handleService(svc *kapi.Service, handler handler) []error { 190 errors := []error{} 191 if !util.ServiceTypeHasNodePort(svc) && len(svc.Spec.ExternalIPs) == 0 { 192 return errors 193 } 194 195 for _, svcPort := range svc.Spec.Ports { 196 if util.ServiceTypeHasNodePort(svc) { 197 klog.V(5).Infof("Handle NodePort service %s port %d", svc.Name, svcPort.NodePort)
But ServiceTypeHasNodePort in 4.13 correctly differentiates between allocateLoadBalancerNodePorts whereas 4.12 does not:
go-controller/pkg/util/kube.go
273 func LoadBalancerServiceHasNodePortAllocation(service *kapi.Service) bool { 274 return service.Spec.AllocateLoadBalancerNodePorts == nil || *service.Spec.AllocateLoadBalancerNodePorts 275 } 277 // ServiceTypeHasNodePort checks if the service has an associated NodePort or not 278 func ServiceTypeHasNodePort(service *kapi.Service) bool { 279 return service.Spec.Type == kapi.ServiceTypeNodePort || 280 (service.Spec.Type == kapi.ServiceTypeLoadBalancer && LoadBalancerServiceHasNodePortAllocation(service)) 281 }
In OCP 4.12:
221 // ServiceTypeHasNodePort checks if the service has an associated NodePort or not 222 func ServiceTypeHasNodePort(service *kapi.Service) bool { 223 return service.Spec.Type == kapi.ServiceTypeNodePort || service.Spec.Type == kapi.ServiceTypeLoadBalancer 224 }
- clones
-
OCPBUGS-16336 Missing BP of LoadBalancerServiceHasNodePortAllocation into 4.12 causes problems with flow creation for these services
- Closed
- links to
-
RHBA-2023:4440 OpenShift Container Platform 4.12.z bug fix update
-
RHBA-2023:5011 OpenShift Container Platform 4.13.z bug fix update