-
Bug
-
Resolution: Done-Errata
-
Major
-
4.18
Hi team,
ovnkube-node pod restart results into crashloopback/error state for container ovnkube-controller after deleting the CUDN resource with the condition there is a existing svc(with no endpoints) in the namespace being targeted by deleted CUDN resource.
Steps to reproduce:
- Create a namespace using primary udn network
apiVersion: v1 kind: Namespace metadata: name: udn3c labels: k8s.ovn.org/primary-user-defined-network: ""
- Create a CUDN resource targeting above namespace
apiVersion: k8s.ovn.org/v1 kind: ClusterUserDefinedNetwork metadata: name: cudnpl3 spec: namespaceSelector: matchExpressions: - key: kubernetes.io/metadata.name operator: In values: - udn3c network: topology: Layer3 layer3: role: Primary subnets: - cidr: "20.100.0.0/16" hostsubnet: 24
- Create a pod in udn3c namespace:
$ oc run pod1 --image gcr.io/google-samples/kubernetes-bootcamp:v1 -n udn3c $ oc get pod -n udn3c NAME READY STATUS RESTARTS AGE pod1 1/1 Running 0 66m
- Create a nodeport svc to expose the pod
$ oc expose pod pod1 --port 8080 --target-port 8080 --type NodePort -n udn3c $ oc get svc -n udn3c NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE pod1 NodePort 172.30.147.164 <none> 8080:30173/TCP 64m
- Now delete the pod `pod1` in `udn3c` namespace
$ oc delete pod pod1 -n udn3c
pod "pod1" deleted
$ oc get ep -n udn3c
NAME ENDPOINTS AGE
pod1 <none> 65m
- Now delete the cudn resource targeting the udn3c namespace
$ oc delete clusteruserdefinednetworks.k8s.ovn.org cudnpl3
clusteruserdefinednetwork.k8s.ovn.org "cudnpl3" deleted
- Restart a ovnkube-node pod and it results into error/crashloopback state:
$ oc delete pod ovnkube-node-t9lvv -n openshift-ovn-kubernetes
pod "ovnkube-node-t9lvv" deleted
$ oc get pods -n openshift-ovn-kubernetes|grep -i ovnkube-node-wstpk ovnkube-node-wstpk 7/8 Error 1 (64s ago) 2m12s [quickcluster@upi-0 ~]$
[quickcluster@upi-0 ~]$ oc logs ovnkube-node-wstpk -n openshift-ovn-kubernetes -c ovnkube-controller|tail -n 3 I0417 07:01:50.392957 1452719 ovnkube.go:599] Stopped ovnkube I0417 07:01:50.393000 1452719 metrics.go:553] Stopping metrics server at address "127.0.0.1:29103" F0417 07:01:50.393122 1452719 ovnkube.go:137] failed to run ovnkube: failed to start node network controller: failed to start default node network controller: error waiting for node readiness: gateway init failed to start watching services: watchResource for resource *factory.serviceForGateway. Failed addHandlerFunc: context deadline exceeded $ oc logs ovnkube-node-wstpk -n openshift-ovn-kubernetes -c ovnkube-controller|grep -i fail|tail -n3 E0417 07:03:08.844670 1454897 factory.go:1320] Failed (will retry) while processing existing *v1.Service items: gateway sync services failed: invalid primary network state for namespace "udn3c": a valid primary user defined network or network attachment definition custom resource, and required namespace label "k8s.ovn.org/primary-user-defined-network" must both be present E0417 07:03:09.352426 1454897 factory.go:1320] Failed (will retry) while processing existing *v1.Service items: gateway sync services failed: invalid primary network state for namespace "udn3c": a valid primary user defined network or network attachment definition custom resource, and required namespace label "k8s.ovn.org/primary-user-defined-network" must both be present E0417 07:03:09.840163 1454897 factory.go:1320] Failed (will retry) while processing existing *v1.Service items: gateway sync services failed: invalid primary network state for namespace "udn3c": a valid primary user defined network or network attachment definition custom resource, and required namespace label "k8s.ovn.org/primary-user-defined-network" must both be present
- To resolve the issue , delete the svc with no endpoints in udn3c namespace:
$ oc delete svc pod1 -n udn3c
service "pod1" deleted
$ oc get pods -n openshift-ovn-kubernetes|grep -i ovnkube-node-wstpk
ovnkube-node-wstpk 8/8 Running 4 (75s ago) 6m10s --> Now the ovnkube-controller container comes up
Expected Result:
ovnkube-node pod should not be in error/crashloopback state when restarted due to above issue since it can impact the customer performing upgrades or any maintenance activity.
Regards,
Manish
- links to
-
RHBA-2025:10767 OpenShift Container Platform 4.18.20 bug fix update