Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-54667

[UDN] ovnkube-controller in crashloopbackoff state as soon as we reboot any of the nodes in OCP cluster resulting nodeNotReady state.

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      When a UDN object is created, and if we reboot any of the nodes in OCP cluster, that node becomes notReady.

      Version-Release number of selected component (if applicable): 4.18.6

      How reproducible: Everytime

      Steps to Reproduce:

      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.18.6    True        False         2d21h   Cluster version is 4.18.6

      => Created a namespace called udn1 with P-UDN annotation 
      
       
      {code:java}
      $ cat udn1-ns.yaml 
      apiVersion: v1
      kind: Namespace
      metadata:
        name: udn1
        labels:
          k8s.ovn.org/primary-user-defined-network: ""

       

       

      => Created a L2 UDN network using below yaml; 

       

       cat udn1-l2.yaml 
      apiVersion: k8s.ovn.org/v1
      kind: UserDefinedNetwork
      metadata:
        name: udn-1-l2 
        namespace: udn1
      spec:
        topology: Layer2 
        layer2: 
          role: Primary 
          subnets:
            - "192.168.123.0/24"

       

       

      => Created a privileged POD using the following YAML

       

      $ cat dep.yaml 
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: ub24-00
        namespace: udn1
      spec:
        selector:
          matchLabels:
            app: ub24-00
        replicas: 2
        template:
          metadata:
            labels:
              app: ub24-00
          spec:
            serviceAccount: demo
            securityContext:
              privileged: true
              runAsUser: 0
            containers:
              - name: ub24-00
                image: ghcr.io/rameshsahoo111/ub24:latest
                command: ["/bin/sleep"]
                args: ["infinity"]
                securityContext:
                  privileged: true
                  runAsUser: 0
                  capabilities:
                  add:
                  - NET_ADMIN
                  - NET_RAW
                  - NET_ADMIN
            imagePullPolicy: Always
       
      # Before creating deployment create the following SCC 
      # oc create sa demo
      # oc adm policy add-scc-to-user privileged -z demo
      

       

      => Reboot any of the node in the cluster.

      Actual results: Node become notReady.

      -> Ovnkube-node pod is 7/8.
      
      $ oc get po -o wide
      NAME                                    READY   STATUS             RESTARTS        AGE     IP               NODE       NOMINATED NODE   READINESS GATES
      ovnkube-control-plane-86567b4c5-57x78   2/2     Running            0               5d22h   192.168.100.7    master-1   <none>           <none>
      ovnkube-control-plane-86567b4c5-dmtgc   2/2     Running            0               5d22h   192.168.100.8    master-2   <none>           <none>
      ovnkube-node-bmxcc                      8/8     Running            9 (5d21h ago)   5d22h   192.168.100.9    worker-0   <none>           <none>
      ovnkube-node-dgf49                      8/8     Running            8               5d22h   192.168.100.6    master-0   <none>           <none>
      ovnkube-node-l79p4                      8/8     Running            9 (5d22h ago)   5d22h   192.168.100.8    master-2   <none>           <none>
      ovnkube-node-mdtxz                      7/8     CrashLoopBackOff   22 (69s ago)    18h     192.168.100.11   worker-2   <none>           <none>
      ovnkube-node-p8l8x                      8/8     Running            9               18h     192.168.100.10   worker-1   <none>           <none>
      ovnkube-node-s645g                      8/8     Running            9 (5d22h ago)   5d22h   192.168.100.7    master-1   <none>           <none> 
      
      
      
      -> Upon checking, we can see that ovn-controller is in crashloopbackOff
      
      
      
        - containerID: cri-o://578955777d13e775142cf38f7228a0dffa2a1c6fcd3110f6940f580e8e7f21ad
          image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5adc4825487255b6d29defb8cebbde1cd7a701b90fa78c4cb6e8b0230458a940
          imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5adc4825487255b6d29defb8cebbde1cd7a701b90fa78c4cb6e8b0230458a940
          lastState:
            terminated:
              containerID: cri-o://578955777d13e775142cf38f7228a0dffa2a1c6fcd3110f6940f580e8e7f21ad
              exitCode: 1
              finishedAt: "2025-04-07T12:34:51Z"
              message: |-
                == 192.168.123.0/24 && ip4.dst == 192.168.123.0/24 Nexthop:<nil> Nexthops:[] Options:map[] Priority:102} on router GR_udn1_udn.1.l2_worker-2: object not found
                E0407 12:34:51.251115   21787 factory.go:1320] Failed (will retry) while processing existing *v1.Node items: failed to initialize networks cluster logical router egress policies for network udn1_udn-1-l2: failed to create no reroute policies for pods on network udn1_udn-1-l2: unable to create IPv4 no-reroute pod policies, err: error creating logical router policy {UUID:u2351433412 Action:allow BFDSessions:[] ExternalIDs:map[ip-family:ip4 k8s.ovn.org/id:default-network-controller:EgressIP:102:EIP-No-Reroute-Pod-To-Pod:ip4:udn1_udn-1-l2 k8s.ovn.org/name:EIP-No-Reroute-Pod-To-Pod k8s.ovn.org/owner-controller:default-network-controller k8s.ovn.org/owner-type:EgressIP network:udn1_udn-1-l2 priority:102] Match:ip4.src == 192.168.123.0/24 && ip4.dst == 192.168.123.0/24 Nexthop:<nil> Nexthops:[] Options:map[] Priority:102} on router GR_udn1_udn.1.l2_worker-2: object not found
                I0407 12:34:51.252043   21787 factory.go:656] Stopping watch factory
                I0407 12:34:51.252120   21787 ovnkube.go:599] Stopped ovnkube
                I0407 12:34:51.252239   21787 metrics.go:553] Stopping metrics server at address "127.0.0.1:29103"
                F0407 12:34:51.252510   21787 ovnkube.go:137] failed to run ovnkube: [failed to start network c
              reason: Error
              startedAt: "2025-04-07T12:33:49Z"
          name: ovnkube-controller
          ready: false
          restartCount: 8
          started: false
          state:
            waiting:
              message: back-off 5m0s restarting failed container=ovnkube-controller pod=ovnkube-node-mdtxz_openshift-ovn-kubernetes(73dc8d00-edc5-48f8-a82a-1ec081112578)
              reason: CrashLoopBackOff
      
      

      Expected results: Node should properly join the cluster.

       

      Additional info:

      Attaching network MG to the JIRA.

              mkennell@redhat.com Martin Kennelly
              rhn-support-nbhatt Neeraj Bhatt
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: