Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57318

[4.18] ovnkube-node pod restart results into crashloopback/error state post deleting the CUDN resource

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • Rejected
    • None
    • In Progress
    • Bug Fix
    • Hide
      * Before this update, deleting a common user data network (CUDN) resource in a namespace with an existing, endpoint-less Service caused the `ovnkube-node` pod restart to fail. With this release, the `ovnkube-node` pod restarts successfully after you delete CUDN resources with existing, endpoint-less services in the targeted namespace. (link:https://issues.redhat.com/browse/OCPBUGS-57318[OCPBUGS-57318])
      Show
      * Before this update, deleting a common user data network (CUDN) resource in a namespace with an existing, endpoint-less Service caused the `ovnkube-node` pod restart to fail. With this release, the `ovnkube-node` pod restarts successfully after you delete CUDN resources with existing, endpoint-less services in the targeted namespace. (link: https://issues.redhat.com/browse/OCPBUGS-57318 [ OCPBUGS-57318 ])
    • None
    • None
    • None
    • None

      Hi team,

      ovnkube-node pod restart results into crashloopback/error state for container ovnkube-controller after deleting the CUDN resource with the condition there is a existing svc(with no endpoints) in the namespace being targeted by deleted CUDN resource.

      Steps to reproduce:

      • Create a namespace using primary udn network
      apiVersion: v1
      kind: Namespace
      metadata:
        name: udn3c
        labels:
          k8s.ovn.org/primary-user-defined-network: ""
      
      • Create a CUDN resource targeting above namespace
      apiVersion: k8s.ovn.org/v1
      kind: ClusterUserDefinedNetwork
      metadata:
        name: cudnpl3
      spec:
        namespaceSelector:
          matchExpressions:
            - key: kubernetes.io/metadata.name
              operator: In
              values:
                - udn3c
        network: 
          topology: Layer3
          layer3: 
            role: Primary 
            subnets:
              - cidr: "20.100.0.0/16"
                hostsubnet: 24
      
      • Create a pod in udn3c namespace:
      $ oc run pod1 --image gcr.io/google-samples/kubernetes-bootcamp:v1 -n udn3c
      
       $ oc get pod -n udn3c
      NAME   READY   STATUS    RESTARTS   AGE
      pod1   1/1     Running   0          66m
      
      
      • Create a nodeport svc to expose the pod
       $ oc expose pod pod1 --port 8080 --target-port 8080 --type NodePort -n udn3c
       
       $ oc get svc -n udn3c
      NAME   TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
      pod1   NodePort   172.30.147.164   <none>        8080:30173/TCP   64m
      
      
      • Now delete the pod `pod1` in `udn3c` namespace
      $ oc delete pod pod1 -n udn3c
      pod "pod1" deleted
      
      $ oc get ep -n udn3c
      NAME   ENDPOINTS   AGE
      pod1   <none>      65m
      
      • Now delete the cudn resource targeting the udn3c namespace
      $ oc delete clusteruserdefinednetworks.k8s.ovn.org cudnpl3
      clusteruserdefinednetwork.k8s.ovn.org "cudnpl3" deleted
      
      • Restart a ovnkube-node pod and it results into error/crashloopback state:
      $ oc delete pod ovnkube-node-t9lvv -n openshift-ovn-kubernetes
      pod "ovnkube-node-t9lvv" deleted
      
      $ oc get pods -n openshift-ovn-kubernetes|grep -i ovnkube-node-wstpk 
      ovnkube-node-wstpk                       7/8     Error     1 (64s ago)   2m12s
      [quickcluster@upi-0 ~]$ 
      
      [quickcluster@upi-0 ~]$ oc logs ovnkube-node-wstpk -n openshift-ovn-kubernetes -c ovnkube-controller|tail -n 3
      I0417 07:01:50.392957 1452719 ovnkube.go:599] Stopped ovnkube
      I0417 07:01:50.393000 1452719 metrics.go:553] Stopping metrics server at address "127.0.0.1:29103"
      F0417 07:01:50.393122 1452719 ovnkube.go:137] failed to run ovnkube: failed to start node network controller: failed to start default node network controller: error waiting for node readiness: gateway init failed to start watching services: watchResource for resource *factory.serviceForGateway. Failed addHandlerFunc: context deadline exceeded
      
      
      $ oc logs ovnkube-node-wstpk -n openshift-ovn-kubernetes -c ovnkube-controller|grep -i fail|tail -n3
      E0417 07:03:08.844670 1454897 factory.go:1320] Failed (will retry) while processing existing *v1.Service items: gateway sync services failed: invalid primary network state for namespace "udn3c": a valid primary user defined network or network attachment definition custom resource, and required namespace label "k8s.ovn.org/primary-user-defined-network" must both be present
      E0417 07:03:09.352426 1454897 factory.go:1320] Failed (will retry) while processing existing *v1.Service items: gateway sync services failed: invalid primary network state for namespace "udn3c": a valid primary user defined network or network attachment definition custom resource, and required namespace label "k8s.ovn.org/primary-user-defined-network" must both be present
      E0417 07:03:09.840163 1454897 factory.go:1320] Failed (will retry) while processing existing *v1.Service items: gateway sync services failed: invalid primary network state for namespace "udn3c": a valid primary user defined network or network attachment definition custom resource, and required namespace label "k8s.ovn.org/primary-user-defined-network" must both be present
      
      
      
      • To resolve the issue , delete the svc with no endpoints in udn3c namespace:
      $ oc delete svc pod1 -n udn3c
      service "pod1" deleted
      
      
      $ oc get pods -n openshift-ovn-kubernetes|grep -i ovnkube-node-wstpk 
      ovnkube-node-wstpk                       8/8     Running   4 (75s ago)   6m10s   --> Now the ovnkube-controller container comes up
      

      Expected Result:
      ovnkube-node pod should not be in error/crashloopback state when restarted due to above issue since it can impact the customer performing upgrades or any maintenance activity.

      Regards,
      Manish

              mkennell@redhat.com Martin Kennelly
              rhn-support-mroy Manish Roy
              None
              None
              Meina Li Meina Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: